C# .Net: Fastest Way to Read Text Files
This will examine many techniques to determine in C# .Net: Fastest Way to Read Text Files or the fastest way to read a single text file.
I have seen a lot of questions asked around the internet asking the question, “what’s the fastest way to read a text file”. I’ve had to write numerous applications which did this, but never gave it serious consideration until I had to write an application which was to read text files with several hundred million lines for processing.
The Set Up:
I wrote a C# Console application to test 9 different techniques to read a text file. This isn’t an exhaustive list, but I believe covers how it’s done most of the time.
The code is written in Visual Studio 2012 targeting .Net Framework version 4.5 x64. The source code is available at the end so you can benchmark it on your own system if you wish.
In a nutshell, the code does the following:
1) Generates a GUID
2) Creates a string object with that GUID repeated either 5, 10, or 25 times
3) Writes the string object to a local text file 4,294,967 times, 2,147,483 times, or 214,748 times.
4) It then reads the text file in using 9 techniques, identified below, clearing all the objects and doing a garbage collection after each run to make sure we start each run with fresh resources:
# |
Technique |
Code Snippet |
||
T1 |
Reading the entire file into a single string using a StreamReader ReadToEnd() method |
|
||
T2 |
Reading the entire file into a single StringBuilder object using a StreamReader ReadToEnd() method |
|
||
T3 |
Reading each line into a string using StreamReader ReadLine() method |
|
||
T4 |
Reading each line into a string using a BufferedStream |
|
||
T5 |
Reading each line into a string using a BufferedStream with a preset buffer size equal to the size of the biggest line |
|
||
T6 |
Reading each line into a StringBuilder object using StreamReader ReadLine() method. |
|
||
T7 |
Reading each line into a |
|
||
T8 |
Reading each line into a pre-allocated string array object. |
|
||
T9 |
Reading the entire file into a string array object using the .Net ReadAllLines() method. |
|
5) The generated file is then deleted.
The exe file was on Windows 7 64-bit with a single purely 7200 rpm mechanical drive as I didn’t want the effects the memory of a “hybrid” drive or mSata card might have on the system to taint the results. This trial was run over the course of three days, once on each day, waiting 5 minutes after the machine was up and running from a cold start up. This was to eliminate any other background processes starting up with might detract from the test.
So what happens? Give us the scoop already!
Before starting, my hypothesis was that I expected reading each line into the same StringBuilder object to excel since no time would be spent constantly creating new string objects (since they’re immutable, a new one has to be created and reassigned with each read).
All times are indicated in seconds. The lower the number, the faster the technique performed.
Columns with a “-” character indicate the test couldn’t be performed because an “out of memory exception” was thrown. For example, apparently 16GB isn’t enough memory to read a 4,294,967 line text file with 25 Guids per line into a single string.
Run #1 |
5 Guids per line |
10 Guids per line |
25 Guids per line |
||||||
|
Lines per file: |
Lines per file: |
Lines per file: |
||||||
|
4,294,967 |
2,147,483 |
214,748 |
4,294,967 |
2,147,483 |
214,748 |
4,294,967 |
2,147,483 |
214,748 |
T1: into single string |
2.7456 |
1.5756 |
0.2652 |
– |
2.8392 |
0.3120 |
– |
– |
0.7332 |
T2: into single StringBuilder |
3.4476 |
1.9032 |
0.1872 |
– |
3.6504 |
0.4368 |
– |
– |
0.9360 |
T3: each line into a string |
2.7768 |
1.3416 |
0.1560 |
5.4912 |
2.7144 |
0.2964 |
13.9620 |
6.9576 |
0.6552 |
T4: T3 using BufferedReader |
2.6676 |
1.3728 |
0.1716 |
5.2728 |
2.5896 |
0.2808 |
13.8060 |
6.9108 |
0.7020 |
T5: T4 w/ preset buffer size |
2.7612 |
1.3884 |
0.1560 |
5.0076 |
2.5116 |
0.2964 |
14.0244 |
6.9264 |
0.7176 |
T6: each line into StringBuilder |
2.9328 |
1.4508 |
0.1716 |
5.5848 |
2.7924 |
0.3432 |
14.0712 |
7.2696 |
0.7020 |
T7: T6 w/ preset size |
2.7144 |
1.4352 |
0.1716 |
5.5692 |
2.7768 |
0.3120 |
14.1180 |
7.4412 |
0.6708 |
T8: into preallocated string[] |
5.9748 |
2.8704 |
0.2652 |
13.6968 |
5.1792 |
0.4680 |
57.3301 |
15.9588 |
1.0608 |
T9: File.ReadAllLines() |
5.7720 |
2.6832 |
0.3276 |
13.1352 |
5.0076 |
0.4836 |
71.9785 |
15.6936 |
1.1388 |
Run #2 |
5 Guids per line |
10 Guids per line |
25 Guids per line |
||||||
|
Lines per file: |
Lines per file: |
Lines per file: |
||||||
|
4,294,967 |
2,147,483 |
214,748 |
4,294,967 |
2,147,483 |
214,748 |
4,294,967 |
2,147,483 |
214,748 |
T1: into single string |
2.8704 |
1.5444 |
0.1716 |
– |
3.1200 |
0.2964 |
– |
– |
0.7332 |
T2: into single StringBuilder |
3.4320 |
1.9656 |
0.2028 |
– |
3.5568 |
0.4212 |
– |
– |
0.9204 |
T3: each line into a string |
2.7612 |
1.3728 |
0.1560 |
5.4132 |
2.7768 |
0.2808 |
13.9776 |
7.1292 |
0.6864 |
T4: T3 using BufferedReader |
2.7300 |
1.4040 |
0.1560 |
5.4444 |
2.8392 |
0.2964 |
14.0400 |
7.0668 |
0.8580 |
T5: T4 w/ preset buffer size |
2.7144 |
1.4040 |
0.1560 |
5.3820 |
2.7768 |
0.2964 |
14.8356 |
7.4880 |
0.7176 |
T6: each line into StringBuilder |
2.8548 |
1.5600 |
0.1716 |
5.5380 |
2.7924 |
0.2964 |
14.2272 |
7.1760 |
0.7488 |
T7: T6 w/ preset size |
2.6832 |
1.4664 |
0.1872 |
5.5692 |
2.8548 |
0.2964 |
14.2740 |
7.2072 |
0.7020 |
T8: into preallocated string[] |
6.1776 |
2.8860 |
0.2808 |
15.0228 |
5.7252 |
0.4680 |
58.8745 |
15.9588 |
1.2012 |
T9: File.ReadAllLines() |
5.9904 |
2.7456 |
0.3120 |
11.5284 |
5.2884 |
0.4992 |
70.9021 |
16.3332 |
1.0608 |
Run #3 |
5 Guids per line |
10 Guids per line |
25 Guids per line |
||||||
|
Lines per file: |
Lines per file: |
Lines per file: |
||||||
|
4,294,967 |
2,147,483 |
214,748 |
4,294,967 |
2,147,483 |
214,748 |
4,294,967 |
2,147,483 |
214,748 |
T1: into single string |
2.8548 |
1.4664 |
0.1404 |
– |
2.8860 |
0.3120 |
– |
– |
0.7332 |
T2: into single StringBuilder |
3.2136 |
1.8252 |
0.2340 |
– |
3.3384 |
0.4056 |
– |
– |
0.9048 |
T3: each line into a string |
2.6208 |
1.3572 |
0.1716 |
5.1636 |
2.5584 |
0.2964 |
13.1820 |
6.6768 |
0.6708 |
T4: T3 using BufferedReader |
2.6364 |
1.2948 |
0.1248 |
5.2416 |
2.5896 |
0.3744 |
13.1196 |
6.6456 |
0.6864 |
T5: T4 w/ preset buffer size |
2.6520 |
1.3104 |
0.1248 |
5.2260 |
2.5896 |
0.2964 |
14.1648 |
7.2384 |
0.7644 |
T6: each line into StringBuilder |
2.8236 |
1.4820 |
0.1560 |
5.3508 |
2.6988 |
0.3120 |
13.4160 |
6.8484 |
0.7020 |
T7: T6 w/ preset size |
2.7768 |
1.3884 |
0.1716 |
5.3196 |
2.6832 |
0.3120 |
13.4160 |
6.8172 |
0.7176 |
T8: into preallocated string[] |
5.9748 |
2.7924 |
0.2652 |
13.8216 |
4.7580 |
0.4680 |
57.7513 |
15.5688 |
1.1388 |
T9: File.ReadAllLines() |
5.7564 |
2.6676 |
0.2808 |
16.0368 |
5.0076 |
0.4863 |
70.1065 |
15.5376 |
1.0608 |
The Results:
Seeing the results, there is no clear-cut winner between techniques T3, T4, T5, T6, & T7. I have read lots of postings across the internet, especially on StackOverflow.com, with people advocating that using a buffered reader is faster. That’s not always the case according to my results. Even when it is, the difference in time is so negligible one has to ask, “Is it worth it?”
There were several more surprises for me:
1) reading each line into a string, buffered or unbuffered, always topped the list. I was expecting reading into a StringBuilder to dominate.
2) reading the entire file into a single string or StringBuilder object didn’t perform well at all relatively speaking.
3) The built in .Net File.ReadAllLines() method performed practically on par or better with reading each line into a pre-allocated string array. I’m surprised by this because I thought that allocating and resizing the array as it goes along would have been costly to the underlying system. The difference in performance when reading 4,294,967 lines with 25 guids per line is the result I would have expected everywhere between the two.
So what technique should you use?
On my system, unless someone spots a flaw in my test code, it really makes no significant performance difference whether you use a regular reader or buffered reader. Plus, now you have code and supporting evidence to dispute someone saying a buffered reader is always faster.
Obviously you should test on your system before micro-optimizing this functionality for your .Net application.
Even though the two techniques of reading the entire file contents into an array were the slowest, in the end these could be the best way for your application if you have a lot of processing to do for each line. For example, if you can work each line independently via parallel processing such as with a Parallel.For or Parallel.ForEach loop because one line’s value isn’t dependent on the next.
Read my blog post, Using C# .Net: Fastest Way to Read and Process Text Files to see just how big of a difference implementing a Parallel.For loop can make!
The results will astound. See them at http://cc.davelozinski.com/c-sharp/the-fastest-way-to-read-and-process-text-files
Bonus Link!
For all the readers who requested it, here’s C# code to serve as a starting point for you to do your own reading lines from a text file in batches and processing in parallel! Enjoy!
The Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 |
using System; using System.IO; using System.Linq; using System.Text; using System.Threading; namespace TestApplication { class Program { static void Main(string[] args) { DateTime end; DateTime start = DateTime.Now; Console.WriteLine("### Overall Start Time: " + start.ToLongTimeString()); Console.WriteLine(); TestReadingLinesFromFile((int)Math.Floor((double)(Int32.MaxValue / 500)), 5); TestReadingLinesFromFile((int)Math.Floor((double)(Int32.MaxValue / 500)), 10); TestReadingLinesFromFile((int)Math.Floor((double)(Int32.MaxValue / 500)), 25); TestReadingLinesFromFile((int)Math.Floor((double)(Int32.MaxValue / 1000)), 5); TestReadingLinesFromFile((int)Math.Floor((double)(Int32.MaxValue / 1000)), 10); TestReadingLinesFromFile((int)Math.Floor((double)(Int32.MaxValue / 1000)), 25); TestReadingLinesFromFile((int)Math.Floor((double)(Int32.MaxValue / 10000)), 5); TestReadingLinesFromFile((int)Math.Floor((double)(Int32.MaxValue / 10000)), 10); TestReadingLinesFromFile((int)Math.Floor((double)(Int32.MaxValue / 10000)), 25); end = DateTime.Now; Console.WriteLine(); Console.WriteLine("### Overall End Time: " + end.ToLongTimeString()); Console.WriteLine("### Overall Run Time: " + (end - start)); Console.WriteLine(); Console.WriteLine("Hit Enter to Exit"); Console.ReadLine(); } //#################################################### //Does a comparison of reading all the lines in from a file. Which way is fastest? static void TestReadingLinesFromFile(int numberOfLines, int numTimesGuidRepeated) { Console.WriteLine("######## " + System.Reflection.MethodBase.GetCurrentMethod().Name); Console.WriteLine("######## Number of lines in file: " + numberOfLines); Console.WriteLine("######## Number of times Guid repeated on each line: " + numTimesGuidRepeated); Console.WriteLine("###########################################################"); Console.WriteLine(); string g = String.Join("", Enumerable.Repeat(new Guid().ToString(), numTimesGuidRepeated)); string[] AllLines = null; string fileName = "Performance_Test_File.txt"; int MAX = numberOfLines; DateTime end; DateTime start = DateTime.Now; //Create the file populating it with GUIDs Console.WriteLine("Generating file: " + start.ToLongTimeString()); using (StreamWriter sw = File.CreateText(fileName)) { for (int x = 0; x < MAX; x++) { sw.WriteLine(g); } } end = DateTime.Now; Console.WriteLine("Finished at: " + end.ToLongTimeString()); Console.WriteLine("Time: " + (end - start)); Console.WriteLine(); GC.Collect(); Thread.Sleep(1000); //give disk hardware time to recover //Just read everything into one string Console.WriteLine("Reading file reading to end into string: "); start = DateTime.Now; try { using (StreamReader sr = File.OpenText(fileName)) { string s = sr.ReadToEnd(); //Obviously you'd then have to process the string } end = DateTime.Now; Console.WriteLine("Finished at: " + end.ToLongTimeString()); Console.WriteLine("Time: " + (end - start)); Console.WriteLine(); } catch (OutOfMemoryException) { end = DateTime.Now; Console.WriteLine("Not enough memory. Couldn't perform this test."); Console.WriteLine("Finished at: " + end.ToLongTimeString()); Console.WriteLine("Time: " + (end - start)); Console.WriteLine(); } catch (Exception) { end = DateTime.Now; Console.WriteLine("EXCEPTION. Couldn't perform this test."); Console.WriteLine("Finished at: " + end.ToLongTimeString()); Console.WriteLine("Time: " + (end - start)); Console.WriteLine(); } GC.Collect(); Thread.Sleep(1000); //give disk hardware time to recover //Read the entire contents into a StringBuilder object Console.WriteLine("Reading file reading to end into stringbuilder: "); start = DateTime.Now; try { using (StreamReader sr = File.OpenText(fileName)) { StringBuilder sb = new StringBuilder(); sb.Append(sr.ReadToEnd()); //Obviously you'd then have to process the string } end = DateTime.Now; Console.WriteLine("Finished at: " + end.ToLongTimeString()); Console.WriteLine("Time: " + (end - start)); Console.WriteLine(); } catch (OutOfMemoryException) { end = DateTime.Now; Console.WriteLine("Not enough memory. Couldn't perform this test."); Console.WriteLine("Finished at: " + end.ToLongTimeString()); Console.WriteLine("Time: " + (end - start)); Console.WriteLine(); } catch (Exception) { end = DateTime.Now; Console.WriteLine("EXCEPTION. Couldn't perform this test."); Console.WriteLine("Finished at: " + end.ToLongTimeString()); Console.WriteLine("Time: " + (end - start)); Console.WriteLine(); } GC.Collect(); Thread.Sleep(1000); //give disk hardware time to recover //Standard and probably most common way of reading a file. Console.WriteLine("Reading file assigning each line to string: "); start = DateTime.Now; using (StreamReader sr = File.OpenText(fileName)) { string s = String.Empty; while ((s = sr.ReadLine()) != null) { //we're just testing read speeds } } end = DateTime.Now; Console.WriteLine("Finished at: " + end.ToLongTimeString()); Console.WriteLine("Time: " + (end - start)); Console.WriteLine(); GC.Collect(); Thread.Sleep(1000); //give disk hardware time to recover //Doing it the most common way, but using a Buffered Reader now. Console.WriteLine("Buffered reading file assigning each line to string: "); start = DateTime.Now; using (FileStream fs = File.Open(fileName, FileMode.Open, FileAccess.Read, FileShare.ReadWrite)) using (BufferedStream bs = new BufferedStream(fs)) using (StreamReader sr = new StreamReader(bs)) { string s; while ((s = sr.ReadLine()) != null) { //we're just testing read speeds } } end = DateTime.Now; Console.WriteLine("Finished at: " + end.ToLongTimeString()); Console.WriteLine("Time: " + (end - start)); Console.WriteLine(); GC.Collect(); Thread.Sleep(1000); //give disk hardware time to recover //Reading each line using a buffered reader again, but setting the buffer size since we know what it will be. Console.WriteLine("Buffered reading with preset buffer size assigning each line to string: "); start = DateTime.Now; using (FileStream fs = File.Open(fileName, FileMode.Open, FileAccess.Read, FileShare.ReadWrite)) using (BufferedStream bs = new BufferedStream(fs, System.Text.ASCIIEncoding.Unicode.GetByteCount(g))) using (StreamReader sr = new StreamReader(bs)) { string s; while ((s = sr.ReadLine()) != null) { //we're just testing read speeds } } end = DateTime.Now; Console.WriteLine("Finished at: " + end.ToLongTimeString()); Console.WriteLine("Time: " + (end - start)); Console.WriteLine(); GC.Collect(); Thread.Sleep(1000); //give disk hardware time to recover //Read every line of the file reusing a StringBuilder object to save on string memory allocation times Console.WriteLine("Reading file assigning each line to StringBuilder: "); start = DateTime.Now; using (StreamReader sr = File.OpenText(fileName)) { StringBuilder sb = new StringBuilder(); while (sb.Append(sr.ReadLine()).Length > 0) { //we're just testing read speeds sb.Clear(); } } end = DateTime.Now; Console.WriteLine("Finished at: " + end.ToLongTimeString()); Console.WriteLine("Time: " + (end - start)); Console.WriteLine(); GC.Collect(); Thread.Sleep(1000); //give disk hardware time to recover //Reading each line into a StringBuilder, but setting the StringBuilder object to an initial //size since we know how long the longest line in the file is. Console.WriteLine("Reading file assigning each line to preset size StringBuilder: "); start = DateTime.Now; using (StreamReader sr = File.OpenText(fileName)) { StringBuilder sb = new StringBuilder(g.Length); while (sb.Append(sr.ReadLine()).Length > 0) { //we're just testing read speeds sb.Clear(); } } end = DateTime.Now; Console.WriteLine("Finished at: " + end.ToLongTimeString()); Console.WriteLine("Time: " + (end - start)); Console.WriteLine(); GC.Collect(); Thread.Sleep(1000); //give disk hardware time to recover //Read each line into an array index. Console.WriteLine("Reading each line into string array: "); start = DateTime.Now; try { AllLines = new string[MAX]; //only allocate memory here using (StreamReader sr = File.OpenText(fileName)) { int x = 0; while (!sr.EndOfStream) { //we're just testing read speeds AllLines[x] = sr.ReadLine(); x += 1; } } end = DateTime.Now; Console.WriteLine("Finished at: " + end.ToLongTimeString()); Console.WriteLine("Time: " + (end - start)); Console.WriteLine(); } catch (OutOfMemoryException) { end = DateTime.Now; Console.WriteLine("Not enough memory. Couldn't perform this test."); Console.WriteLine("Finished at: " + end.ToLongTimeString()); Console.WriteLine("Time: " + (end - start)); Console.WriteLine(); } catch (Exception) { end = DateTime.Now; Console.WriteLine("EXCEPTION. Couldn't perform this test."); Console.WriteLine("Finished at: " + end.ToLongTimeString()); Console.WriteLine("Time: " + (end - start)); Console.WriteLine(); } finally { if (AllLines != null) { Array.Clear(AllLines, 0, AllLines.Length); AllLines = null; } } GC.Collect(); Thread.Sleep(1000); //Read the entire file using File.ReadAllLines. Console.WriteLine("Performing File ReadAllLines into array: "); start = DateTime.Now; try { AllLines = new string[MAX]; //only allocate memory here AllLines = File.ReadAllLines(fileName); end = DateTime.Now; Console.WriteLine("Finished at: " + end.ToLongTimeString()); Console.WriteLine("Time: " + (end - start)); Console.WriteLine(); } catch (OutOfMemoryException) { end = DateTime.Now; Console.WriteLine("Not enough memory. Couldn't perform this test."); Console.WriteLine("Finished at: " + end.ToLongTimeString()); Console.WriteLine("Time: " + (end - start)); Console.WriteLine(); } catch (Exception) { end = DateTime.Now; Console.WriteLine("EXCEPTION. Couldn't perform this test."); Console.WriteLine("Finished at: " + end.ToLongTimeString()); Console.WriteLine("Time: " + (end - start)); Console.WriteLine(); } finally { if (AllLines != null) { Array.Clear(AllLines, 0, AllLines.Length); AllLines = null; } } File.Delete(fileName); fileName = null; GC.Collect(); } } } |