6

C#.Net: The Fastest Way to count substring occurrences in a string

Spread the love

C#.Net: The Fastest Way to count substring occurrences in a string

This will examine many techniques to determine in C# .Net: the Fastest Way to count substring occurrences in a string.

 

Background:

This test stemmed from a project where I had to do a lot of substring searches. I couldn’t use some simple built-in methods like String.Contains() or IndexOf() straight out of the box because they only test to see if a substring exists, not how many times a string occurs in another string.

Initially I fell back on the good old tried, tested, and proven custom counting method that I’ve used numerous times. Also, if you followed my Fastest Way to check if a string occurs within a string benchmarks:

 

So that’s when this curious consultant started wondering… are there other techniques to count how many times a substring occurs in a string? If so, what is the fastest way to count substring occurrences within another string?

 

The Set Up:

A C# Console application was written to test several counting techniques: 5 single-threaded; 5 multi-threaded.

The code is written in Visual Studio 2012 targeting .Net Framework version 4.5 x64. The source code is available at the end of this blog so you can benchmark it on your own system if you wish.

In a nutshell, the code does the following:

  1. Creates an array of random strings that will serve as the strings to be searched using the System.Web.Security.Membership.GeneratePassword method. We’ll call these “A”.
  2. Creates an array of random strings that will serve as the strings being searched for using the System.Web.Security.Membership.GeneratePassword method. We’ll call these “B”.
  3. Executes the techniques to see if any of the search strings occur in the strings to be searched. The techniques are as follows where “ss” is the array of “Strings to Search” and “sf” is the array of strings to “Search For”:

 

#

Technique

Code Snippet

T1

Common Counting

T2

Code from DotNetPerls

T3

String.Split; count results.

T4

C# Regex

T5

C# Regex w/ AsParallel()

T6

T1 implemented with a multi-threaded Parallel.For loop

T7

T2 implemented with a multi-threaded Parallel.For loop

T8

T3 implemented with a multi-threaded Parallel.For loop

T9

T4 implemented with a multi-threaded Parallel.For loop

T10

T5 implemented with a multi-threaded Parallel.For loop

  1. The sum from adding up the number of times a match is found is displayed along with the execution time
  2. The arrays are cleared out and deleted being the responsible programmer I am. 🙂

 

The code assumes all tests will happen with no exception testing because I’m just wanting to test the raw speed of finding substrings in a string. There may be other techniques, but I couldn’t find or think of any. However, if you have a great alternative solution, please post!

It also assumes that substrings do not overlap.

The exe file was installed and run on an Alienware M17X R3 running Windows 7 64-bit with 16 GB memory on an i7-2820QM processor.
The test was run for each technique over the following number of comparisons:

  • Counting the number of times 1 string occurs in 5,000, 25,000, 100,000, and 1,000,000 million strings.
  • Counting the number of times 100 strings occur in 5,000, 25,000, 100,000, and 1,000,000 million strings.
  • Counting the number of times 1,000 strings occur in 5,000, 25,000, 100,000, and 1,000,000 million strings.

 

Yes in most cases, programmers are only going to test if 1 or 2 strings occur in another string. But:

  1.  We’re having fun aren’t we?
  2.  Since the project I worked on had several hundred thousand strings that needed to be searched, I was definitely interested in how all the techniques handled larger quantities.

 

The Runs:

Before starting, my hypothesis was that I expected the variations of the custom counting method to be the winner since it generally does perform the best whether counting substring occurrences in strings or characters in strings. It’s amazing how fast and versatile this method is.

Let’s see what happened on my machine.

All times are indicated in hours:minutes:seconds.milliseconds format. Yes, some took over an hour! Lower numbers are better because they indicate faster performance.

 

Winners are highlighted in green.

 

Counting the number of occurrences of 1 string in each of

Run #1:

5,000

25,000

100,000

1,000,000

T1: Common Counting

0.00

0.0010000

0.0070004

00.0710040

T2: Code from DotNetPerls

0.0010000

0.00

0.0130008

00.1330082

T3: Split String Count

0.0020001

0.0070004

0.0270017

00.2520133

T4: Regex.Matches.Count

0.0040003

0.0170010

0.0580033

00.5440323

T5: AsParallel w/ Regex

0.1570085

0.5600008

2.4421396

29.8928806

 

T6: T1 using Parallel.For

0.0190010

0.0010001

0.0020001

0.0210012

T7: T2 using Parallel.For

0.0020002

0.0010000

0.0040003

0.0380021

T8: T3 using Parallel.For

0.0030001

0.0100001

0.0140008

0.1800048

T9: T4 using Parallel.For

0.0060004

0.0200010

0.1020036

1.7210413

T10: T5 using Parallel.For

5.5300077

0.8320046

5.3341221

2:36.9828740

 

Counting the number of occurrences of 100 strings in each of

Run #2:

5,000

25,000

100,000

1,000,000

T1: Common Counting

0.0310018

00.1570090

00.6170353

0:06.0542581

T2: Code from DotNetPerls

0.0580033

00.2930167

01.1640666

0:11.3702949

T3: Split String Count

0.0890051

00.4480256

02.1521002

0:16.9683032

T4: Regex.Matches.Count

2.0531174

10.3155901

43.2109377

6:30.6880561

T5: AsParallel w/ Regex

2.7730793

14.1273355

51.6757052

8:12.0271304

 

T6: T1 using Parallel.For

0.0070004

00.0380021

00.1420059

00:01.5230843

T7: T2 using Parallel.For

0.0180010

00.0750049

00.3050192

00:03.3811934

T8: T3 using Parallel.For

0.1030022

00.6790102

00.7300389

00:07.3602943

T9: T4 using Parallel.For

1:25.7861022

8:01.4317111

42.4874429

10:36.6637675

T10: T5 using Parallel.For

1:46.1456623

9:01.4171999

41.8417822

27:55.6335064

 

Counting the number of occurrences of 1,000 strings in each of

Run #3:

5,000

25,000

100,000

1,000,000

T1: Common Counting

00.2900004

0:01.4860621

00:05.8750850

0:00:58.7494732

T2: Code from DotNetPerls

00.5600007

0:02.7800039

00:11.3132955

0:01:53.0386768

T3: Split String Count

00.8400012

0:04.2170303

00:16.8953990

0:02:50.4861649

T4: Regex.Matches.Count

20.5655871

1:47.1916370

06:46.6989794

1:05:35.8084806

T5: AsParallel w/ Regex

16.6234057

1:24.2900854

05:36.0162879

0:56:57.8961879

 

T6: T1 using Parallel.For

00.0810018

0:00.3720185

00:01.5030859

0:00:15.1558518

T7: T2 using Parallel.For

00.1710098

0:00.8580491

00:03.3751931

0:00:33.7619311

T8: T3 using Parallel.For

00.3880016

0:01.8071199

00:08.6232645

0:02:03.7490701

T9: T4 using Parallel.For

20.0905242

1:40.5890094

07:46.7686951

1:12:04.7092390

T10: T5 using Parallel.For

15.5083999

1:21.5940674

05:02.5684600

0:52:34.3056667

 

The Results:

The technique T1 implementing a common counting method won by running faster every single time. In more than half the tests it completed roughly twice as fast as the next fastest technique, T2 from DotNetPerls. As the old saying goes, “green across the board”.

T1, T2, & their multi-threaded equivalents dominated.

The real surprise was how the single-threaded T1, in most cases, ran faster than half the multi-threaded parallel techniques even at the higher string counts!

Regex is a powerful tool, but as one can see from the results, if you know you’re going to perform counts 100 times or more for a simple string rather than a complicated pattern, Regex should not be used.

The other techniques were respectable: not the greatest, not the worst. They only seriously started falling behind when counting 100 or more substrings in more than 1000 strings, which as I stated probably doesn’t happen too often in real world programming applications.

 

In Summary:

On my system, unless someone spots a flaw in my test code, Technique 1 with the common counting method was the single-threaded winner. Implementing the parallel version was the multi-threaded winner. Any application that needs micro optimization or where speed is of the essence, nothing else should be considered.

Otherwise, there are two other techniques that can be utilized. It really comes down to personal taste and how much one cares about code readability versus speed.

If you’re only going to be counting a few substrings within a small number of search strings, you can consider Regex as a readable alternative. Again it’s that speed versus readability trade-off. If you’re going to count more than just a few substrings occurrences, either don’t use Regex or consider taking a long lunch break while waiting for the results.

 

The Code:

 


Spread the love

David Lozinski