24

The Fastest Way to Read and Process Text Files using C# .Net


Using C# .Net: Fastest Way to Read and Process Text Files

This will benchmark many techniques to determine in C# .Net: Fastest Way to Read and Process Text Files.

Building upon my previous article what’s the fastest way to read a text file (http://cc.davelozinski.com/c-sharp/fastest-way-to-read-text-files), some applications require extensive processing of each line of data from the file. So we need to test more than raw reading speeds – let’s test the various reading techniques while including some mathematical number crunching from each line.

 

Settings Things Up:

I wrote a C# Console application to test many different techniques to read a text file and process the lines contained therein. This isn’t an exhaustive list, but I believe covers how it’s done most of the time.

The code is written in Visual Studio 2012 targeting .Net Framework version 4.5 x64. The source code is available at the end of this blog so you can benchmark it on your own system if you wish.

In a nutshell, the code does the following:

  1. Generates a GUID
  2. Creates a string object with that GUID repeated either 5, 10, or 25 times
  3. Writes the string object to a local text file 429,496 or 214,748 times.
  4. It then reads the text file in using various techniques, identified below, clearing all the objects and doing a garbage collection after each run to make sure we start each run with fresh resources:

#

Technique

Code Snippet

T1

Reading the entire file into a single string using the StreamReader ReadToEnd() method, then process the entire string.

T2

Reading the entire file into a single StringBuilder object using the ReadToEnd() method, then process the entire string.

T3

Reading each line into a string, and process line by line.

T4

Reading each line into a string using a BufferedStream, and process line by line.

T5

Reading each line into a string using a BufferedStream with a preset buffer size equal to the size of the biggest line, and process line by line.

T6

Reading each line into a StringBuilder object, and process line by line.

T7

Reading each line into a StringBuilder object with its size preset and equal to the size of the biggest line, and process line by line.

T8

Reading each line into a pre-allocated string array object, then run a Parallel.For loop to process all the lines in parallel.

T9

Reading the entire file into a string array object using the .Net ReadAllLines() method, then run a Parallel.For loop to process all the lines in parallel.

  1. Each line in the file is processed by being split into a string array containing its individual guids. Then each string is parsed character by character to determine if it’s a number and if so, so a mathematical calculation based on it.
  2. The generated file is then deleted.

The exe file was installed and run on an Alienware M17X R3 on a single purely 7200 rpm mechanical drive as I didn’t want the effects the memory of a “hybrid” drive or mSata card might have on the system to taint the results. The Alienware is running Windows 7 64-bit with 16 GB memory on an i7-2820QM processor. This trial was run once, waiting 5 minutes after the machine was up and running from a cold start up. This was to eliminate any other background processes starting up with might detract from the test. There was no reason to run this test multiple times because as you’ll see, there are clear winners and losers.

 

The Runs:

Before starting, my hypothesis was that I expected the techniques that read the entire file into an array, and then using parallel for loops to process all the lines would win out hands down.

Let’s see what happened on my machine. Green cells indicate the winner(s) for that run; yellow second runners up.

All times are indicated in minutes:seconds.milliseconds format. Lower numbers indicate faster performance.

Run #1

5 Guids Per Line

10 Guids Per Line

25 Guids Per Line

Lines per file:

Lines per file:

Lines per file:

429,496

214,748

4,294,967

214,748

4,294,967

214,748

T1

26.0165

12.8108

51.2161

25.6457

2:08.0661

1:04.0958

T2

25.8557

12.8692

51.2843

25.7571

2:08.7938

1:04.1300

T3

25.5055

12.9920

50.9340

25.6576

2:07.8043

1:03.8621

T4

25.5241

12.8205

51.0251

25.5980

2:07.7404

1:03.8547

T5

25.4960

12.8065

50.9899

25.5554

2:08.1822

1:04.0174

T6

25.6190

12.8883

51.0363

25.6011

2:07.9028

1:03.8462

T7

25.5769

12.8838

51.3235

25.5201

2:08.4510

1:03.8346

T8

07.3555

03.9828

14.7095

07.8946

0:36.2732

0:18.7467

T9

07.2808

03.9742

14.8749

07.9938

0:38.9168

0:19.1223

 

Sha-Bam! Parallel Processing Dominates!

Seeing the results, there is no clear-cut winner between techniques T1 – T7. T8 & T9, which implemented the parallel processing techniques, completely dominated. Those techniques always finished in less than a third (33%) of the time it took any technique processing line by line.

The surprise for me came where each line was 10 guids in length. From that point forward, the .Net inbuilt File.ReadAllLines() method started performing slower. This wasn’t quite so evident when just plain reading a file. However, it indicates that if you really want to micro-optimize your code for speed, always pre-allocate the size of a string array when possible.

 

In Summary:

On my system, unless someone spots a flaw in my test code, reading an entire file into an array and then processing line-by-line using a parallel loop proved significantly more beneficial than reading a line, processing a line. Unfortunately I still see a lot of C# programmers and C# code running .Net 4 (or above) doing the age old “read a line, process line, repeat until end of file” technique instead of “read all the lines into memory and then process”. The performance difference is so great it even makes up for the loss of time when just reading a file.

This test code is just doing mathematical calculations too. The difference in performance may be even greater if you need to do other things to process your data as well, such as running a database query.

Obviously you should test on your system before micro-optimizing this functionality for your .Net application.

Otherwise, thanks to .Net 4, the age of parallel processing is easily accomplished in C# now. It’s time to break out of old patterns and start taking advantage of the power made available to us.

 

Bonus Link!

For all the readers who requested it, here’s C# code to serve as a starting point for you to do your own reading lines from a text file in batches and processing in parallel! Enjoy!

 

The Code: