0

C# .Net: Fastest Way to count the number of times a character occurs in a string

The Fastest Way to count the number of times a character occurs in a string using C# .Net:

This test stemmed from my other test on the fastest way to count the number of times a string occurs within another string (http://cc.davelozinski.com/c-sharp/fastest-way-to-check-if-a-string-occurs-within-a-string). I couldn’t find any native C# methods to do this although other languages like PHP have such methods (count_chars).

Again, I initially implemented the custom counting method that I’ve used numerous times. Then I tried using a foreach construct to loop over the string to count the occurrences.

So then this curious consultant started wondering… are there other techniques to count how many times a character occurs in a string? If so, what is the fastest way to count the number of times a character occurs in a string using C# .Net?

 

The Set Up:

I wrote a C# Console application to test several different techniques: some single-threaded; some using parallel processing.

The code is written in Visual Studio 2012 targeting .Net Framework version 4.5 x64. The source code is available at the end of this blog so you can benchmark it on your own system if you wish.

In a nutshell, the code does the following:

1) Creates an array of random strings that will serve as the strings to be searched using the System.Web.Security.Membership.GeneratePassword method. We’ll call these “A”.

2) Creates an array of random characters that will be searched for. These were generated using the System.Web.Security.Membership.GeneratePassword method with a length of 1.

3) Creates an array of random 1-char length strings from the character array generated. This is because some techniques only accept strings or string arrays as parameters to methods.

4) Executes the various techniques counting how many times the current character occurs in the strings to be searched. The techniques are as follows:

#

Technique

Code Snippet (ss = SearchStrings array; ch = array of chars to search for)

T1

A common counting method

T2

Splitting each string on the character and counting the results.

T3

using Linq.Count()

T4

using a foreach construct to loop over the string and count the number of occurrences.

T5

same as T4 except using a straight for-loop.

T6

Using Regex.Matches().Count

T7

Regex.Matches() with AsParallel()

T8

T1 with a Parallel.For loop implemented

T9

T2 with a Parallel.For loop implemented

T10

T3 with a Parallel.For loop implemented

T11

Linq.AsParallel().Sum().Count()

T12

T4 with an outer Parallel.For loop

T13

T5 with a Parallel.For loop implemented

T14

T6 with a Parallel.For loop implemented

T15

T7 with a Parallel.For loop implemented

 

5) The sum from adding up the number of times a match is found is displayed along with the execution time

6) The arrays are cleared out and deleted being the responsible programmer I am. 🙂

 

The code assumes all tests will happen with no exception testing because I’m just wanting to test the raw speed of counting characters in a string. There may be other techniques, but I couldn’t find or think of any. However, if you have a great alternative solution, please post!

 

The exe file was installed and run on an Alienware M17X R3 running Windows 7 64-bit with 16 GB memory on an i7-2820QM processor.

The test was run for each technique over the following number of comparisons:

  • Counting the number of times 1 character occurs in 5,000, 25,000, 100,000, and 1,000,000 million strings.
  • Counting the number of times 100 different characters occur in 5,000, 25,000, 100,000, and 1,000,000 million strings.
  • Counting the number of times 1,000 different characters occur in 5,000, 25,000, 100,000, and 1,000,000 million strings.

 

For the usual disclaimer, I’m guessing that in most real world applications programmers are only going to count how many times less than a dozen or so difference characters occur in a string. But:

1) We’re having fun aren’t we?

2) I’d rather see time differences that are in seconds and minutes as unless I’m measuring how far electricity travels in a nanosecond, most of the time speeds with differences in nanoseconds don’t concern me.

 

The Runs:

I expected the custom counting method to be the winner both among the single threaded implementations and parallel implementations since it generally does perform the best when counting substring occurrences in strings. It’s amazing how fast and versatile this method is. I think the foreach loop will come second since loops are generally fast.

Amongst the parallel implementations, my guesses are the same.

Let’s see what happened on my machine.

All times are indicated in minutes:seconds.milliseconds format. Lower numbers are better because they indicate faster performance.

Winners in each group are highlighted in green; second runners up are in yellow.

Counting the number of occurrences of 1 character in each of

5,000

25,000

100,000

1,000,000

T1: Common Counting

0:00

0:00

0:00.0090005

0:00.1000058

T2: Count Split String on Char

0:00

0:00.0100000

0:00.0220013

0:00.1750100

T3: Linq.Count() method

0:00

0:00.0100000

0:00.0230013

0:00.2250128

T4: Using a Foreach loop

0:00

0:00

0:00.0030002

0:00.0340020

T5: Using a straight For loop

0:00

0:00

0:00.0030002

0:00.0310018

T6: Regex.Matches().Count

0:00.0100000

0:00.0200000

0:00.0700040

0:00.6850391

T7: Regex.Matches() w/ AsParallel()

0:00.1300002

0:00.6640034

0:02.6831535

0:32.1327834

————————————

—————————————————————————————————-

T8: T1 using Parallel.For

0:00.0200000

0:00

0:00.0020001

0:00.0200000

T9: T2 using Parallel.For

0:00

0:00.0040002

0:00.0070004

0:00.1100002

T10: T3 using Parallel.For

0:00

0:00.0020001

0:00.0090005

0:00.1100001

T11: Linq.AsParallel().Sum().Count()

0:00.0200001

0:00.0060004

0:00.0230013

0:00.2300003

T12: T4 w/ outer Parallel.For

0:00

0:00

0:00.0010001

0:00

T13: T5 w/ a Parallel.For loop

0:00

0:00

0:00

0:00.0100000

T14: T6 w/ a Parallel.For loop

0:00.0100000

0:00.0150008

0:00.1200068

0:02.8171556

T15: T7 w/ a Parallel.For loop

0:04.0700057

0:00.8620494

0:06.0600576

2:57.4024249

Counting the number of occurrences of 100 characters in each of

5,000

25,000

100,000

1,000,000

T1: Common Counting

0:00.0400001

0:00.2080119

0:00.8080462

0:07.7300811

T2: Count Split String on Char

0:00.0700001

0:00.3800218

0:01.4240815

0:12.2523097

T3: Linq.Count() method

0:00.0900001

0:00.5360306

0:02.0390653

0:19.8895015

T4: Using a Foreach loop

0:00.0100000

0:00.0680039

0:00.2600004

0:02.5700036

T5: Using a straight For loop

0:00.0100001

0:00.0660038

0:00.2500003

0:02.5811454

T6: Regex.Matches().Count

0:01.2100016

0:05.9241932

0:23.8045932

3:25.6541196

T7: Regex.Matches() w/ AsParallel()

0:02.1771162

0:10.0142949

0:38.9638934

6:10.5981013

————————————

—————————————————————————————————-

T8: T1 using Parallel.For

0:00.0130007

0:00.0600001

0:00.2000003

0:01.9971143

T9: T2 using Parallel.For

0:00.0650038

0:00.5300008

0:00.5300007

0:04.5300314

T10: T3 using Parallel.For

0:00.0300017

0:00.1600002

0:00.5500008

0:05.8501600

T11: Linq.AsParallel().Sum().Count()

0:00.0360020

0:00.1800002

0:00.6500009

0:06.8791418

T12: T4 w/ outer Parallel.For

0:00.0020002

0:00.0100001

0:00.0400001

0:00.4790246

T13: T5 w/ a Parallel.For loop

0:00.0020001

0:00.0100000

0:00.0490022

0:00.4550260

T14: T6 w/ a Parallel.For loop

0:50.5456623

4:09.9217926

0:28.2587944

6:44.9385592

T15: T7 w/ a Parallel.For loop

1:09.6211301

5:28.3786036

0:47.7781771

17:45.0864363

Counting the number of occurrences of 1,000 characters in each of

5,000

25,000

100,000

1,000,000

T1: Common Counting

0:00.3910224

0:01.8300026

0:07.4012889

01:15.1327787

T2: Count Split String on Char

0:00.6170353

0:02.9400041

0:11.8072959

01:58.7759565

T3: Linq.Count() method

0:00.9880370

0:05.0552439

0:19.7883562

03:18.8558825

T4: Using a Foreach loop

0:00.1300002

0:00.6400366

0:02.5641467

00:25.1036541

T5: Using a straight For loop

0:00.1200001

0:00.6050073

0:02.4770898

00:24.5996253

T6: Regex.Matches().Count

0:10.5512464

0:51.8051882

3:35.6503236

35:01.7138958

T7: Regex.Matches() w/ AsParallel()

0:11.7842829

0:57.4234783

3:57.3748046

39:16.2812479

————————————

—————————————————————————————————-

T8: T1 using Parallel.For

0:00.1060061

0:00.4700006

0:01.9391109

00:21.1805983

T9: T2 using Parallel.For

0:00.4290245

0:00.9800014

0:04.4810236

00:42.1928995

T10: T3 using Parallel.For

0:00.2800160

0:01.3200019

0:05.7781553

01:02.7566210

T11: Linq.AsParallel().Sum().Count()

0:00.3000166

0:01.5300021

0:06.6711455

01:07.7516639

T12: T4 w/ outer Parallel.For

0:00.0300001

0:00.1130037

0:00.4510231

00:04.5612581

T13: T5 w/ a Parallel.For loop

0:00.0200000

0:00.1020058

0:00.4360249

00:04.4922570

T14: T6 w/ a Parallel.For loop

0:27.4546028

1:10.5877826

5:35.2173197

48:41.3312784

T15: T7 w/ a Parallel.For loop

0:21.8845905

0:47.9591842

3:18.2509539

35:10.9285042

 

The Results:

Well was I ever surprised seeing just how fast the for-loop was in relation to all the other single threaded methods. It even beat out my favorite custom technique T1. Both the for and foreach loops continuously completed in 1/3 the time as the next fastest technique, T1.

On the multi-threaded parallel side, the Parallel.For loop was the clear winner, completing in at least 1/4 the time (sometimes in 1/5 the time) as the next best besides the Parallel.ForEach loop.

To those people who like to debate whether the For or ForEach is faster, my results clearly show that as the number of characters we’re searching for increases, the For loop continuously outperforms the ForEach version (albeit not by much).

The overall winner is obviously one of the For-loop implementations.

Using the AsParallel() method didn’t do squat. Who in Microsoft wrote the underlying code for this? It’s slow compared to everything else!

Linq has once again shown just how slow it is. Why do people use it?

What boggles my mind is why does the Parallel.For implementation of the Regex.Matches().Count technique take twice as long as the single threaded version?! Can any of my readers explain this to me please?

 

In Summary:

On my system, unless someone spots a flaw in my test code, in the single threaded category the for-loop was the clear winner. There’s no reason why in any application the for-loop technique shouldn’t be used. Nothing else comes close for pure speed except the ForEach loops.

When it comes to parallel processing, the Parallel.For loop won every time, suffering for second place twice.

So if you’re wanting pure speed, the Parallel.For technique should be used when searching 25,000 or more strings.

Otherwise, the other single-threaded techniques should hold up rather well and lend for easy code readability.

Just don’t use Regex, Linq, and/or AsParallel(). 🙂

 

The Code: