22

The Fastest Way to Read and Process Text Files using C# .Net


Using C# .Net: Fastest Way to Read and Process Text Files

This will benchmark many techniques to determine in C# .Net: Fastest Way to Read and Process Text Files.

Building upon my previous article what’s the fastest way to read a text file (http://cc.davelozinski.com/c-sharp/fastest-way-to-read-text-files), some applications require extensive processing of each line of data from the file. So we need to test more than raw reading speeds – let’s test the various reading techniques while including some mathematical number crunching from each line.

 

Settings Things Up:

I wrote a C# Console application to test many different techniques to read a text file and process the lines contained therein. This isn’t an exhaustive list, but I believe covers how it’s done most of the time.

The code is written in Visual Studio 2012 targeting .Net Framework version 4.5 x64. The source code is available at the end of this blog so you can benchmark it on your own system if you wish.

In a nutshell, the code does the following:

  1. Generates a GUID
  2. Creates a string object with that GUID repeated either 5, 10, or 25 times
  3. Writes the string object to a local text file 429,496 or 214,748 times.
  4. It then reads the text file in using various techniques, identified below, clearing all the objects and doing a garbage collection after each run to make sure we start each run with fresh resources:

#

Technique

Code Snippet

T1

Reading the entire file into a single string using the StreamReader ReadToEnd() method, then process the entire string.

T2

Reading the entire file into a single StringBuilder object using the ReadToEnd() method, then process the entire string.

T3

Reading each line into a string, and process line by line.

T4

Reading each line into a string using a BufferedStream, and process line by line.

T5

Reading each line into a string using a BufferedStream with a preset buffer size equal to the size of the biggest line, and process line by line.

T6

Reading each line into a StringBuilder object, and process line by line.

T7

Reading each line into a StringBuilder object with its size preset and equal to the size of the biggest line, and process line by line.

T8

Reading each line into a pre-allocated string array object, then run a Parallel.For loop to process all the lines in parallel.

T9

Reading the entire file into a string array object using the .Net ReadAllLines() method, then run a Parallel.For loop to process all the lines in parallel.

  1. Each line in the file is processed by being split into a string array containing its individual guids. Then each string is parsed character by character to determine if it’s a number and if so, so a mathematical calculation based on it.
  2. The generated file is then deleted.

The exe file was installed and run on an Alienware M17X R3 on a single purely 7200 rpm mechanical drive as I didn’t want the effects the memory of a “hybrid” drive or mSata card might have on the system to taint the results. The Alienware is running Windows 7 64-bit with 16 GB memory on an i7-2820QM processor. This trial was run once, waiting 5 minutes after the machine was up and running from a cold start up. This was to eliminate any other background processes starting up with might detract from the test. There was no reason to run this test multiple times because as you’ll see, there are clear winners and losers.

 

The Runs:

Before starting, my hypothesis was that I expected the techniques that read the entire file into an array, and then using parallel for loops to process all the lines would win out hands down.

Let’s see what happened on my machine. Green cells indicate the winner(s) for that run; yellow second runners up.

All times are indicated in minutes:seconds.milliseconds format. Lower numbers indicate faster performance.

Run #1

5 Guids Per Line

10 Guids Per Line

25 Guids Per Line

Lines per file:

Lines per file:

Lines per file:

429,496

214,748

4,294,967

214,748

4,294,967

214,748

T1

26.0165

12.8108

51.2161

25.6457

2:08.0661

1:04.0958

T2

25.8557

12.8692

51.2843

25.7571

2:08.7938

1:04.1300

T3

25.5055

12.9920

50.9340

25.6576

2:07.8043

1:03.8621

T4

25.5241

12.8205

51.0251

25.5980

2:07.7404

1:03.8547

T5

25.4960

12.8065

50.9899

25.5554

2:08.1822

1:04.0174

T6

25.6190

12.8883

51.0363

25.6011

2:07.9028

1:03.8462

T7

25.5769

12.8838

51.3235

25.5201

2:08.4510

1:03.8346

T8

07.3555

03.9828

14.7095

07.8946

0:36.2732

0:18.7467

T9

07.2808

03.9742

14.8749

07.9938

0:38.9168

0:19.1223

 

Sha-Bam! Parallel Processing Dominates!

Seeing the results, there is no clear-cut winner between techniques T1 – T7. T8 & T9, which implemented the parallel processing techniques, completely dominated. Those techniques always finished in less than a third (33%) of the time it took any technique processing line by line.

The surprise for me came where each line was 10 guids in length. From that point forward, the .Net inbuilt File.ReadAllLines() method started performing slower. This wasn’t quite so evident when just plain reading a file. However, it indicates that if you really want to micro-optimize your code for speed, always pre-allocate the size of a string array when possible.

 

In Summary:

On my system, unless someone spots a flaw in my test code, reading an entire file into an array and then processing line-by-line using a parallel loop proved significantly more beneficial than reading a line, processing a line. Unfortunately I still see a lot of C# programmers and C# code running .Net 4 (or above) doing the age old “read a line, process line, repeat until end of file” technique instead of “read all the lines into memory and then process”. The performance difference is so great it even makes up for the loss of time when just reading a file.

This test code is just doing mathematical calculations too. The difference in performance may be even greater if you need to do other things to process your data as well, such as running a database query.

Obviously you should test on your system before micro-optimizing this functionality for your .Net application.

Otherwise, thanks to .Net 4, the age of parallel processing is easily accomplished in C# now. It’s time to break out of old patterns and start taking advantage of the power made available to us.

 

Bonus Link!

For all the readers who requested it, here’s C# code to serve as a starting point for you to do your own reading lines from a text file in batches and processing in parallel! Enjoy!

 

The Code:

 

  • Melvin

    With parrallel processing this will only work if say line one has nothing to do with line three. Else things dont get processed sequentially.

  • Benjamin Holland

    How would using Memory Mapped Files affect this?

  • areaem

    Fantastic article.

    Question I have is, If I don’t know how many lines my potentially large text file will have, how do I allocate memory to the string array? What would you recommend in this case?

    • FireMystdl

      Two options off the top of my head:
      1) allocate a set size array, say 50000 in size. Then read your file 50000 lines at a time, process, repeat until you reach the end of the file.
      2) use a List object and read line by line. Or, if you know your file is always going to contain at least x-lines, preallocate your List to “x” size first and then read line by line.

  • Moon inmoon

    I think your test may be being giving a specific picture for parallel. Your …DoStuff() does something numerically intense. SO as you are doing it in parallel – it is LIKELY to be faster. However, if you reduce what DoStuff does, the picture is less clear.

    I tried your approach for Parsing my very large text files. Each line contains a Date, an int and 2 doubles. Each needs to be parsed. But the parsing doesn’t take that long, so now the overhead of starting x (I tried 2,10 100, 1000) tasks is the key factor.

  • What if Im uploading a file to an asp.net page and I already have the file stream in memory, file.InputStream, should I save it first on disk before porcess the lines or its just a waste of time? with the file in a stream I cant use
    File.ReadAllLines().

    Thinking in your former articler, when we just read the file, when I already have the file stream what should be faster? read all lines in a single string and then Split string in lines for proccessing it, use a buffered stream like T5. Should I make an array with the lines before proccessing it and proccess ir with parallell.for or there isnt performance gain with it?

    My main doubt is: if the file is already in a stream, are there any performance gain in reading the strings line by line, or “buffered” line by line ?

    • Dave

      If you already have a file stream from an upload, just use a technique above and read each line into a List until the end of stream. Similar to T8 above. With Lists, you don’t have to worry about a “fixed size” array.

      Then just Parallel.For over the List object.

      • It was exactly what i neded and worked fastly. Thanks!

  • Rolf Wessels

    Great article. I really love it when people take the time to
    learn more about the code that they are writing and why. I have a bit of a
    concern with your conclusion. I believe that your code tests two variables 1)
    Reading speed 2) processing speed. You have done an amazing job of proving that
    processing over multiple threads/cors is faster that processing on single
    thread/cor. The numbers even add up, your processor i7-2820QM is a quad
    core and therefore we would expect it would be just less than 4 times faster processing data
    when using all cors (12/4 = 3). One would really hope that all developers know
    this by now but if they didn’t then they do now, thanks to you. What your tests
    do not prove is reading everything into memory is faster than reading it line
    by line. I think that the reason why developers still use the “old school” approach
    is to avoid that age old “OutOfMemoryException” (as I
    see you are familiar with based on your code). So in my opinion the best approach
    would really be to read the file line by line but process the data in parallel.
    This point is obviously mute if you have a maximum file size and a minimum
    memory size for each application.

    • Dave

      Thanks for your feedback Rolf. I’m actually working on a project now where I’ve implemented a “hybrid” approach to your suggestion which I think a lot of people could do. The C# program I’m working on queries massive amounts of data from an SQL Server database. What I’ve implemented is a system where I keep track of the number of records downloaded from the database and load them into an in-memory datatable. When I hit a specified threshold (currently 180,000 records), I implement a Parallel.For loop to iterate over the records and do what I need to do with them. It’s awesome because 1) I get good network speeds by letting the system continuously download from the DB instead of stopping for milliseconds at a time to process each record individually; 2) it takes the server less than 2 seconds to complete the Parallel.For loop (it’s fun watching the temporary CPU usage spikes across all the cores). Hopefully this gives other developers inspiration to implement similar systems. Basically, you just have to do what’s best for your environment. 🙂

      • Rolf Wessels

        Sounds interesting, you should have a look at reactive extentions for C#. Particularly buffered extensions (http://rxwiki.wikidot.com/101samples#toc26). It could be a nice addition for future projects without the need to write your own buffer. That with a combination of tpl dataflow to manage your parallelism could be another way to skin your cat :-). How did you decide on 180,000 records? Looking forward to seeing your write up about your current implementation.

  • alex

    How about async read the file and async process the lines? Could async beats the parallel way T8 and T9?

    • Dave

      I cannot say with 100% certainty, but I doubt it will. Why? You can’t process a line in the file until you finish reading a line in the file, whether it’s done async or not. Using async will help improve an application’s responsiveness if you’re writing an application with a GUI interface, or want other things to happen while you’re processing the file. So no, I don’t think so. However, the test code is published above, so feel free to try it and let us know what happens.

  • BJ

    Also it would be interesting to compare File.ReadAllLines which is used in T9 but in a single thread vs the other methods.

    • FireMystdl

      I’m not sure what you mean. Can you elaborate further what you mean by “single thread vs the other methods”? File.ReadAllLines is single threaded by it’s nature; the other techniques all use a single-threaded algorithm when reading because those are the most common techniques programmers use.

  • BJ

    Interesting article, however I’m not sure its a fair test as you’re just testing the actual read from the file and then throw that data away. What would be interesting is which technique read the file and then return a collection of the objects, as that is surely the normal use case for file reading. I wonder if the overhead of locking the return collection and any subsequent sort would would bring it more in line?

    • FireMystdl

      Every programmer and program is different. What’s “normal” is relative. I have come across a lot of production code where people read a line, do something, and then are done with it; on the other hand, I’ve also come across code where people read data into an array, do something, and then chuck the array to the garbage collection.

      The source code is there so feel free to modify, run the tests, and report back. 🙂

  • Rahgozar

    thanks a lot
    this article is very very good and usable for me.

  • J

    First, great article and benchmark testing. I understand that your processing is performing arithmetic computations, but would these different methods still perform and have the same end-result if I processed these documents a different way? For my use case, I have two.

    01) I have a custom application that would need to “extract” image files that are stored in a document management system. These would most likely be tiff or pdf files and some text files. However, the program reads a text file to know what documents to extract.

    02) I have another custom application that would need to convert a text file into a pdf file. I would extract a file and save it to a Windows server from the document management system and if the file is a text file, then I would convert the file into a pdf file.

    Would your testings be applicable to either of my use cases? Thank you in advance for your feedback!

    • David Lozinski

      Thank you for the compliments and feedback. 🙂

      but would these different methods still perform and have the same end-result if I processed these documents a different way?

      Depends on what you’re doing. Generally speaking, “yes” if you can operate on each line independently of the others; “no” or “maybe” if any of the lines are dependent on any other line. Your examples illustrate this:

      01) I have a custom application that would need to “extract” image files that are stored in a document management system. These would most likely be tiff or pdf files and some text files. However, the program reads a text file to know what documents to extract.

      In this example things should still work as per my test. You read one file in its entirety to find out all the image files you need to extract, then you use the parallel loop to run through the list and extract each file.

      02) I have another custom application that would need to convert a text file into a pdf file. I would extract a file and save it to a Windows server from the document management system and if the file is a text file, then I would convert the file into a pdf file.

      This situation it depends because you don’t give enough detail, but I’ll give you a guide:
      1) if it’s just ONE file you need to convert, then no unless you want to do some complicated programming. Otherwise you’ll basically read through one file and convert it.
      2) If there are multiple files, you could have a parallel loop which loops over each file and converts each one. But you’ll have to benchmark single vs multi-threaded because it could take longer in a parallel loop if you have a relatively small number of files to convert as opposed to a lot. This is dependent upon a number of factors such as: how big the files to convert are; how long it actually takes to convert; how long it’ll take for the system to allocate and keep track of the multiple threads; etc etc.

      In situation #2, it will probably be best for you to write everything as a single threaded loop, test it, and then go back and rewrite for parallelism, benchmark, and compare the results.

      Good luck!

  • cem

    This was a really nice comparison worked quite nice for my case. It is especially useful when the processing task is parallelizable

    • David Lozinski

      Glad it helped! I’ve found in most cases rewriting code to take advantage of parallelism .Net 4+ allows for significantly improves performance. Need to get out of the old mind set of reading a line, processing a line, reading, processing… now it’s “read all” and then “parallel process all independently”.