Using C# .Net: Fastest Way to Read and Process Text Files
This will benchmark many techniques to determine in C# .Net: Fastest Way to Read and Process Text Files.
Building upon my previous article what’s the fastest way to read a text file (http://cc.davelozinski.com/c-sharp/fastest-way-to-read-text-files), some applications require extensive processing of each line of data from the file. So we need to test more than raw reading speeds – let’s test the various reading techniques while including some mathematical number crunching from each line.
Settings Things Up:
I wrote a C# Console application to test many different techniques to read a text file and process the lines contained therein. This isn’t an exhaustive list, but I believe covers how it’s done most of the time.
The code is written in Visual Studio 2012 targeting .Net Framework version 4.5 x64. The source code is available at the end of this blog so you can benchmark it on your own system if you wish.
In a nutshell, the code does the following:
- Generates a GUID
- Creates a string object with that GUID repeated either 5, 10, or 25 times
- Writes the string object to a local text file 429,496 or 214,748 times.
- It then reads the text file in using various techniques, identified below, clearing all the objects and doing a garbage collection after each run to make sure we start each run with fresh resources:
# |
Technique |
Code Snippet |
||
T1 |
Reading the entire file into a single string using the StreamReader ReadToEnd() method, then process the entire string. |
|
||
T2 |
Reading the entire file into a single StringBuilder object using the ReadToEnd() method, then process the entire string. |
|
||
T3 |
Reading each line into a string, and process line by line. |
|
||
T4 |
Reading each line into a string using a BufferedStream, and process line by line. |
|
||
T5 |
Reading each line into a string using a BufferedStream with a preset buffer size equal to the size of the biggest line, and process line by line. |
|
||
T6 |
Reading each line into a StringBuilder object, and process line by line. |
|
||
T7 |
Reading each line into a StringBuilder object with its size preset and equal to the size of the biggest line, and process line by line. |
|
||
T8 |
Reading each line into a pre-allocated string array object, then run a Parallel.For loop to process all the lines in parallel. |
|
||
T9 |
Reading the entire file into a string array object using the .Net ReadAllLines() method, then run a Parallel.For loop to process all the lines in parallel. |
|
- Each line in the file is processed by being split into a string array containing its individual guids. Then each string is parsed character by character to determine if it’s a number and if so, so a mathematical calculation based on it.
- The generated file is then deleted.
On a Windows 7 64-bit machine with 16GB of memory using a purely 7200 rpm mechanical drive as I didn’t want the effects the memory of a “hybrid” drive or mSata card might have on the system to taint the results.
This trial was run once, waiting 5 minutes after the machine was up and running from a cold start up. This was to eliminate any other background processes starting up with might detract from the test. There was no reason to run this test multiple times because as you’ll see, there are clear winners and losers.
The Runs:
Before starting, my hypothesis was that I expected the techniques that read the entire file into an array, and then using parallel for loops to process all the lines would win out hands down.
Let’s see what happened on my machine. Green cells indicate the winner(s) for that run; yellow second runners up.
All times are indicated in minutes:seconds.milliseconds format. Lower numbers indicate faster performance.
Run #1 |
5 Guids Per Line |
10 Guids Per Line |
25 Guids Per Line |
|||
|
Lines per file: |
Lines per file: |
Lines per file: |
|||
|
429,496 |
214,748 |
4,294,967 |
214,748 |
4,294,967 |
214,748 |
T1: string, ReadToEnd, process |
26.0165 |
12.8108 |
51.2161 |
25.6457 |
2:08.0661 |
1:04.0958 |
T2: StringBuilder, ReadToEnd, process |
25.8557 |
12.8692 |
51.2843 |
25.7571 |
2:08.7938 |
1:04.1300 |
T3: StreamReader, read line by line, process |
25.5055 |
12.9920 |
50.9340 |
25.6576 |
2:07.8043 |
1:03.8621 |
T4: BufferedStream, read line by line, process |
25.5241 |
12.8205 |
51.0251 |
25.5980 |
2:07.7404 |
1:03.8547 |
T5: BufferedStream with buffer size preset, read line by line, process |
25.4960 |
12.8065 |
50.9899 |
25.5554 |
2:08.1822 |
1:04.0174 |
T6: StreamReader, read line by line into StringBuilder, process |
25.6190 |
12.8883 |
51.0363 |
25.6011 |
2:07.9028 |
1:03.8462 |
T7: as above with StringBuilder size preset |
25.5769 |
12.8838 |
51.3235 |
25.5201 |
2:08.4510 |
1:03.8346 |
T8: StreamReader, read into preset String[], process using Parallel.For |
07.3555 |
03.9828 |
14.7095 |
07.8946 |
0:36.2732 |
0:18.7467 |
T9: ReadAllLines into String[], process using Parallel.For |
07.2808 |
03.9742 |
14.8749 |
07.9938 |
0:38.9168 |
0:19.1223 |
Sha-Bam! Parallel Processing Dominates!
Seeing the results, there is no clear-cut winner between techniques T1 – T7. T8 & T9, which implemented the parallel processing techniques, completely dominated. Those techniques always finished in less than a third (33%) of the time it took any technique processing line by line.
The surprise for me came where each line was 10 guids in length. From that point forward, the .Net inbuilt File.ReadAllLines() method started performing slower. This wasn’t quite so evident when just plain reading a file. However, it indicates that if you really want to micro-optimize your code for speed, always pre-allocate the size of a string array when possible.
In Summary:
On my system, unless someone spots a flaw in my test code, reading an entire file into an array and then processing line-by-line using a parallel loop proved significantly more beneficial than reading a line, processing a line. Unfortunately I still see a lot of C# programmers and C# code running .Net 4 (or above) doing the age old “read a line, process line, repeat until end of file” technique instead of “read all the lines into memory and then process”. The performance difference is so great it even makes up for the loss of time when just reading a file.
This test code is just doing mathematical calculations too. The difference in performance may be even greater if you need to do other things to process your data as well, such as running a database query.
Obviously you should test on your system before micro-optimizing this functionality for your .Net application.
Otherwise, thanks to .Net 4, the age of parallel processing is easily accomplished in C# now. It’s time to break out of old patterns and start taking advantage of the power made available to us.
Bonus Link!
For all the readers who requested it, here’s C# code to serve as a starting point for you to do your own reading lines from a text file in batches and processing in parallel! Enjoy!
The Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 |
using System; using System.Collections.Generic; using System.Collections; using System.Collections.Concurrent; using System.IO; using System.Linq; using System.Text; using System.Text.RegularExpressions; using System.Threading.Tasks; using System.Threading; namespace TestApplication { class Program { static void Main(string[] args) { DateTime end; DateTime start = DateTime.Now; Console.WriteLine("### Overall Start Time: " + start.ToLongTimeString()); Console.WriteLine(); TestReadingAndProcessingLinesFromFile((int)Math.Floor((double)(Int32.MaxValue / 5000)), 5); TestReadingAndProcessingLinesFromFile((int)Math.Floor((double)(Int32.MaxValue / 5000)), 10); TestReadingAndProcessingLinesFromFile((int)Math.Floor((double)(Int32.MaxValue / 5000)), 25); TestReadingAndProcessingLinesFromFile((int)Math.Floor((double)(Int32.MaxValue / 10000)), 5); TestReadingAndProcessingLinesFromFile((int)Math.Floor((double)(Int32.MaxValue / 10000)), 10); TestReadingAndProcessingLinesFromFile((int)Math.Floor((double)(Int32.MaxValue / 10000)), 25); end = DateTime.Now; Console.WriteLine(); Console.WriteLine("### Overall End Time: " + end.ToLongTimeString()); Console.WriteLine("### Overall Run Time: " + (end - start)); Console.WriteLine(); Console.WriteLine("Hit Enter to Exit"); Console.ReadLine(); } //#################################################### //Does a comparison of reading all the lines in from a file and performing some rudimentary //operations on them. Which way is fastest? static void TestReadingAndProcessingLinesFromFile(int numberOfLines, int numTimesGuidRepeated) { Console.WriteLine("######## " + System.Reflection.MethodBase.GetCurrentMethod().Name); Console.WriteLine("######## Number of lines in file: " + numberOfLines); Console.WriteLine("######## Number of times Guid repeated on each line: " + numTimesGuidRepeated); Console.WriteLine("###########################################################"); Console.WriteLine(); string g = String.Join(" ", Enumerable.Repeat(new Guid().ToString(), numTimesGuidRepeated)); string[] AllLines = null; string fileName = "Performance_Test_File.txt"; int MAX = numberOfLines; DateTime end; DateTime start = DateTime.Now; //Create the file populating it with GUIDs Console.WriteLine("Generating file: " + start.ToLongTimeString()); using (StreamWriter sw = File.CreateText(fileName)) { for (int x = 0; x < MAX; x++) { sw.WriteLine(g); } } end = DateTime.Now; Console.WriteLine("Finished at: " + end.ToLongTimeString()); Console.WriteLine("Time: " + (end - start)); Console.WriteLine(); GC.Collect(); Thread.Sleep(1000); //give disk hardware time to recover //Just read everything into one string Console.WriteLine("Reading file reading to end into string: "); start = DateTime.Now; try { using (StreamReader sr = File.OpenText(fileName)) { string s = sr.ReadToEnd(); TestReadingAndProcessingLinesFromFile_DoStuff(s); } end = DateTime.Now; Console.WriteLine("Finished at: " + end.ToLongTimeString()); Console.WriteLine("Time: " + (end - start)); Console.WriteLine(); } catch (OutOfMemoryException) { end = DateTime.Now; Console.WriteLine("Not enough memory. Couldn't perform this test."); Console.WriteLine("Finished at: " + end.ToLongTimeString()); Console.WriteLine("Time: " + (end - start)); Console.WriteLine(); } catch (Exception) { end = DateTime.Now; Console.WriteLine("EXCEPTION. Couldn't perform this test."); Console.WriteLine("Finished at: " + end.ToLongTimeString()); Console.WriteLine("Time: " + (end - start)); Console.WriteLine(); } GC.Collect(); Thread.Sleep(1000); //give disk hardware time to recover //Read the entire contents into a StringBuilder object Console.WriteLine("Reading file reading to end into stringbuilder: "); start = DateTime.Now; try { using (StreamReader sr = File.OpenText(fileName)) { StringBuilder sb = new StringBuilder(); sb.Append(sr.ReadToEnd()); TestReadingAndProcessingLinesFromFile_DoStuff(sb.ToString()); //to simulate work } end = DateTime.Now; Console.WriteLine("Finished at: " + end.ToLongTimeString()); Console.WriteLine("Time: " + (end - start)); Console.WriteLine(); } catch (OutOfMemoryException) { end = DateTime.Now; Console.WriteLine("Not enough memory. Couldn't perform this test."); Console.WriteLine("Finished at: " + end.ToLongTimeString()); Console.WriteLine("Time: " + (end - start)); Console.WriteLine(); } catch (Exception) { end = DateTime.Now; Console.WriteLine("EXCEPTION. Couldn't perform this test."); Console.WriteLine("Finished at: " + end.ToLongTimeString()); Console.WriteLine("Time: " + (end - start)); Console.WriteLine(); } GC.Collect(); Thread.Sleep(1000); //give disk hardware time to recover //Standard and probably most common way of reading a file. Console.WriteLine("Reading file assigning each line to string: "); start = DateTime.Now; using (StreamReader sr = File.OpenText(fileName)) { string s = String.Empty; while ((s = sr.ReadLine()) != null) { TestReadingAndProcessingLinesFromFile_DoStuff(s); //to simulate work } } end = DateTime.Now; Console.WriteLine("Finished at: " + end.ToLongTimeString()); Console.WriteLine("Time: " + (end - start)); Console.WriteLine(); GC.Collect(); Thread.Sleep(1000); //give disk hardware time to recover //Doing it the most common way, but using a Buffered Reader now. Console.WriteLine("Buffered reading file assigning each line to string: "); start = DateTime.Now; using (FileStream fs = File.Open(fileName, FileMode.Open, FileAccess.Read, FileShare.ReadWrite)) using (BufferedStream bs = new BufferedStream(fs)) using (StreamReader sr = new StreamReader(bs)) { string s; while ((s = sr.ReadLine()) != null) { TestReadingAndProcessingLinesFromFile_DoStuff(s); //to simulate work } } end = DateTime.Now; Console.WriteLine("Finished at: " + end.ToLongTimeString()); Console.WriteLine("Time: " + (end - start)); Console.WriteLine(); GC.Collect(); Thread.Sleep(1000); //give disk hardware time to recover //Reading each line using a buffered reader again, but setting the buffer size since we know what it will be. Console.WriteLine("Buffered reading with preset buffer size assigning each line to string: "); start = DateTime.Now; using (FileStream fs = File.Open(fileName, FileMode.Open, FileAccess.Read, FileShare.ReadWrite)) using (BufferedStream bs = new BufferedStream(fs, System.Text.ASCIIEncoding.Unicode.GetByteCount(g))) using (StreamReader sr = new StreamReader(bs)) { string s; while ((s = sr.ReadLine()) != null) { TestReadingAndProcessingLinesFromFile_DoStuff(s); //to simulate work } } end = DateTime.Now; Console.WriteLine("Finished at: " + end.ToLongTimeString()); Console.WriteLine("Time: " + (end - start)); Console.WriteLine(); GC.Collect(); Thread.Sleep(1000); //give disk hardware time to recover //Read every line of the file reusing a StringBuilder object to save on string memory allocation times Console.WriteLine("Reading file assigning each line to StringBuilder: "); start = DateTime.Now; using (StreamReader sr = File.OpenText(fileName)) { StringBuilder sb = new StringBuilder(); while (sb.Append(sr.ReadLine()).Length > 0) { TestReadingAndProcessingLinesFromFile_DoStuff(sb.ToString()); //to simulate work sb.Clear(); } } end = DateTime.Now; Console.WriteLine("Finished at: " + end.ToLongTimeString()); Console.WriteLine("Time: " + (end - start)); Console.WriteLine(); GC.Collect(); Thread.Sleep(1000); //give disk hardware time to recover //Reading each line into a StringBuilder, but setting the StringBuilder object to an initial //size since we know how long the longest line in the file is. Console.WriteLine("Reading file assigning each line to preset size StringBuilder: "); start = DateTime.Now; using (StreamReader sr = File.OpenText(fileName)) { StringBuilder sb = new StringBuilder(g.Length); while (sb.Append(sr.ReadLine()).Length > 0) { TestReadingAndProcessingLinesFromFile_DoStuff(sb.ToString()); //to simulate work sb.Clear(); } } end = DateTime.Now; Console.WriteLine("Finished at: " + end.ToLongTimeString()); Console.WriteLine("Time: " + (end - start)); Console.WriteLine(); GC.Collect(); Thread.Sleep(1000); //give disk hardware time to recover //Read each line into an array index. Console.WriteLine("Reading each line into string array. Process with Parallel.For: "); start = DateTime.Now; try { AllLines = new string[MAX]; //only allocate memory here using (StreamReader sr = File.OpenText(fileName)) { int x = 0; while (!sr.EndOfStream) { //we're just testing read speeds AllLines[x] = sr.ReadLine(); x += 1; } } //CLOSE THE FILE because we are now DONE with it. Parallel.For(0, AllLines.Length, x => { TestReadingAndProcessingLinesFromFile_DoStuff(AllLines[x]); //to simulate work }); end = DateTime.Now; Console.WriteLine("Finished at: " + end.ToLongTimeString()); Console.WriteLine("Time: " + (end - start)); Console.WriteLine(); } catch (OutOfMemoryException) { end = DateTime.Now; Console.WriteLine("Not enough memory. Couldn't perform this test."); Console.WriteLine("Finished at: " + end.ToLongTimeString()); Console.WriteLine("Time: " + (end - start)); Console.WriteLine(); } catch (Exception) { end = DateTime.Now; Console.WriteLine("EXCEPTION. Couldn't perform this test."); Console.WriteLine("Finished at: " + end.ToLongTimeString()); Console.WriteLine("Time: " + (end - start)); Console.WriteLine(); } finally { if (AllLines != null) { Array.Clear(AllLines, 0, AllLines.Length); AllLines = null; } } GC.Collect(); Thread.Sleep(1000); //Read the entire file using File.ReadAllLines. Console.WriteLine("Performing File ReadAllLines into array. Process with Parallel.For: "); start = DateTime.Now; try { AllLines = new string[MAX]; //only allocate memory here AllLines = File.ReadAllLines(fileName); Parallel.For(0, AllLines.Length, x => { TestReadingAndProcessingLinesFromFile_DoStuff(AllLines[x]); //to simulate work }); end = DateTime.Now; Console.WriteLine("Finished at: " + end.ToLongTimeString()); Console.WriteLine("Time: " + (end - start)); Console.WriteLine(); } catch (OutOfMemoryException) { end = DateTime.Now; Console.WriteLine("Not enough memory. Couldn't perform this test."); Console.WriteLine("Finished at: " + end.ToLongTimeString()); Console.WriteLine("Time: " + (end - start)); Console.WriteLine(); } catch (Exception) { end = DateTime.Now; Console.WriteLine("EXCEPTION. Couldn't perform this test."); Console.WriteLine("Finished at: " + end.ToLongTimeString()); Console.WriteLine("Time: " + (end - start)); Console.WriteLine(); } finally { if (AllLines != null) { Array.Clear(AllLines, 0, AllLines.Length); AllLines = null; } } File.Delete(fileName); fileName = null; GC.Collect(); } //Just simulates doing work on a line read from an input file static void TestReadingAndProcessingLinesFromFile_DoStuff(string s) { string[] sa = s.Split(new char[' ']); int[] ia = new int[sa.Length]; int num = 0; for (int x = 0; x < sa.Length; x++) { foreach (char c in sa[x]) { if (int.TryParse(c.ToString(), out num)) { //just doing some bogus mathematical calculations to simulate work ia[x] = (int)((Math.Sqrt(Math.Log(num) % Math.Log10(num))) * (Math.Log(Math.Log10(num) / Math.Sqrt(num)))); } } } //clean up Array.Clear(ia, 0, ia.Length); Array.Clear(sa, 0, sa.Length); ia = null; sa = null; } } //class } //namespace |