22

C# .Net: Fastest Way to check if a string occurs within a string

Using C# .Net: Fastest Way to check if a string occurs within a string.

This will examine many techniques to determine in C# .Net: Fastest Way to check if a string occurs within a string.

 

The Background:

How many of us C# programmers have had to check if a string is contained within another string? A simple match. We don’t care how many times it may exist, we only want to know if it does.

There are numerous native C# methods for doing this: String.Contains(), String.IndexOf(), through Regex regular expressions, and similar options for those programmers obsessed with LINQ.

So that’s when this curious consultant started wondering… what is the fastest way to test and see if a string is contained within another string?

 

The Set Up:

I wrote a C# Console application to test 16 different search techniques: 8 single threaded loop; 8 multi-threaded parallel.for loop.

 

The code is written in Visual Studio 2012 targeting .Net Framework version 4.5 x64. The source code is available at the end of this blog so you can benchmark it on your own system if you wish.

 

In a nutshell, the code does the following:

1) Creates an array of random strings that will serve as the strings to be searched using the System.Web.Security.Membership.GeneratePassword method. We’ll call these “A”.

2) Creates an array of random strings that will serve as the strings being searched for using the System.Web.Security.Membership.GeneratePassword method. We’ll call these “B”.

3) Executes the techniques to see if any of the search strings occur in the strings to be searched. The techniques are as follows:

#

Technique

Code Snippet

T1

A common counting method

T2

splitting each string in A by string B and counting the results.

T3

String.Contains()

T4

String.IndexOf()

T5

Using Contains() on a LINQ query

T6

Using IndexOf() on a LINQ query

T7

Regex.IsMatch()

T8

AsParallel() with Regex

T9

T1 implemented with a Parallel.For construct

T10

T2 implemented with a Parallel.For construct

T11

T3 implemented with a Parallel.For construct

T12

T4 implemented with a Parallel.For construct

T13

T5 implemented with a Parallel.For construct

T14

T6 implemented with a Parallel.For construct

T15

T7 implemented with a Parallel.For construct

T16

T8 implemented with a Parallel.For construct

4) The sum from adding up the number of times a match is found is displayed along with the execution time

5) The arrays are cleared out and deleted being the responsible programmer I am. 🙂

 

The code assumes all tests will happen with no exception testing because I’m just wanting to test the raw speed of finding a string in a string. There may be other techniques, but I think these are the most common. However, if you have a great alternative solution, please post!

 

The exe file was run on a Windows 7 64-bit with 16 GB memory.

 

The test was run for each technique over the following number of comparisons:

  • Searching for 1 string in 5,000, 25,000, 100,000, and 1,000,000 million strings.
  • Searching for 100 strings in 5,000, 25,000, 100,000, and 1,000,000 million strings.
  • Searching for 1000 strings in 5,000, 25,000, 100,000, and 1,000,000 million strings.

 

In regards to the last two bullet points, I do realize though that in most cases, programmers are only going to test if 1 or 2 strings occur in another string. But:

1) We’re having fun aren’t we?

2) Why not?

 

Off To The Races!

Before starting, my hypothesis was that I expected the custom counting method to be the winner in both the single and multi-threaded groups since it generally does perform the best whether counting substring occurrences in strings or characters in strings. It’s amazing how fast and versatile this method is.

 

All times are indicated in hours:minutes:seconds.milliseconds format. Yes, some took over an hour! The lower the number, the faster the technique performed.

Green cells indicate the winner(s) for that run; yellow cells indicate the second runner(s) up.

 

Searching for 1 string in each of

 

5,000

25,000

100,000

1,000,000

T1: Common Counting

0.00

0.00

0.0100000

0.0690040

T2: Count Split String

0.00

0.01

0.0300001

0.2390137

T3: String.Contains()

0.00

0.00

0.0100000

0.1190068

T4: String.IndexOf()

0.00

0.00

0.0200000

0.1220070

T5: Linq.Contains()

0.00

0.00

0.0100000

0.0880050

T6: Linq.IndexOf()

0.00

0.0100001

0.0100001

0.1200069

T7: Regex.IsMatch()

0.01

0.01

0.0500000

0.5340306

T8: AsParallel() with Regex

0.1090063

0.5928010

2.4271172

28.0040902

 

T9: T1 using Parallel.For

0.04

0.00

0.00

0.0220012

T10: T2 using Parallel.For

0.00

0.01

0.0100000

0.0770044

T11: T3 using Parallel.For

0.00

0.00

0.0100001

0.0240014

T12: T4 using Parallel.For

0.00

0.00

0.00

0.0340019

T13: T5 using Parallel.For

0.00

0.00

0.0100000

0.0920053

T14: T6 using Parallel.For

0.00

0.00

0.0100000

0.1210069

T15: T7 using Parallel.For

0.00

0.02

0.0600001

0.5930339

T16: T8 using Parallel.For

9.5066852

0.9984017

4.1209639

02:40.4553675

 

Searching for 100 strings in each of

 

5,000

25,000

100,000

1,000,000

T1: Common Counting

0.0310018

00.1500003

00.6110350

00:05.9901200

T2: Count Split String

0.0890051

00.4200005

01.7090514

00:17.0874199

T3: String.Contains()

0.0400023

00.2000003

00.7900011

00:08.0801731

T4: String.IndexOf()

0.0580033

00.2800004

01.1000016

00:11.1742954

T5: Linq.Contains()

0.0400023

00.2000003

00.8000011

00:08.3932904

T6: Linq.IndexOf()

0.0560032

00.2700004

01.1200016

00:11.4832952

T7: Regex.IsMatch()

1.9920827

10.0042932

39.9819828

06:18.0251803

T8: AsParallel() with Regex

2.6716222

13.2043531

47.6943911

07:51.0616836

 

T9: T1 using Parallel.For

0.0100000

0.0300000

00.1320076

00:01.3670726

T10: T2 using Parallel.For

0.0400001

0.1700014

00.9670553

00:08.1112882

T11: T3 using Parallel.For

0.0100000

0.0610007

00.2090119

00:02.1291190

T12: T4 using Parallel.For

0.0100000

0.0720041

00.2890166

00:02.9511688

T13: T5 using Parallel.For

0.0100000

0.0580033

00.2230127

00:02.4241386

T14: T6 using Parallel.For

0.0200000

0.0830025

00.3260187

00:03.3881732

T15: T7 using Parallel.For

1.5900023

7.9991942

56.9313084

13:04.8093665

T16: T8 using Parallel.For

1:28.8255242

7:37.3664312

35.9700924

06:47.5672122

 

Searching for 1,000 strings in each of

 

5,000

25,000

100,000

1,000,000

T1: Common Counting

00.3400005

0:01.4500020

0:05.8460253

0:00:58.7804783

T2: Count Split String

00.9260396

0:04.1800059

0:16.9595644

0:02:48.1291422

T3: String.Contains()

00.4140236

0:02.0341091

0:08.0100821

0:01:20.2320263

T4: String.IndexOf()

00.5780331

0:02.8061605

0:11.0782430

0:01:51.8527094

T5: Linq.Contains()

00.4080233

0:01.9950192

0:08.1842706

0:01:23.5070696

T6: Linq.IndexOf()

00.5830334

0:02.7400039

0:11.2912949

0:01:54.8588055

T7: Regex.IsMatch()

20.3724376

1:40.5314815

6:28.7074927

1:06:19.6771386

T8: AsParallel() with Regex

16.1745050

1:21.2604800

5:29.5454050

0:56:01.1048914

 

T9: T1 using Parallel.For

00.0770011

0:00.3400194

0:01.3430768

0:00:13.4787570

T10: T2 using Parallel.For

00.3550013

0:01.7851021

0:06.7901925

0:01:12.7307796

T11: T3 using Parallel.For

00.1160039

0:00.5270302

0:02.1261188

0:00:23.4993413

T12: T4 using Parallel.For

00.1510086

0:00.7440425

0:02.9851708

0:00:33.0888926

T13: T5 using Parallel.For

00.1140065

0:00.5470291

0:02.2521288

0:00:25.5383988

T14: T6 using Parallel.For

00.1560067

0:00.7890429

0:03.1391778

0:00:35.0009991

T15: T7 using Parallel.For

18.2395857

2:57.7773771

6:24.3384817

1:45:56.4084164

T16: T8 using Parallel.For

14.9139029

1:15.3743422

4:58.9888284

0:52:50.3983364

 

The Results:

It’s quite obvious technique T1 implementing a common counting method won by running faster every single time than any of the native C# methods, Linq queries, or Regex matching. As the old saying goes, “green across the board”.

 

To my surprise, String.Contains() consistently ran second fastest, and stood out from the rest of the pack of contenders as we increased the number of strings we were searching for.

 

I’m also surprised the String.IndexOf() ran so slow in comparison. Definitely a method to avoid using except when necessary.

 

The Linq.Contains() technique isn’t a bad alternative to using String.Contains(). Personally, I’m not crazy about Linq and avoid it like the plague, but I have to give it a fair assessment in these tests.

 

Regex is a powerful tool, but as one can see from the results, if you know you’re going to perform checks 100 times or more for a simple string rather than a complicated pattern, Regex should not be used.

 

In Summary:

On my system, unless someone spots a flaw in my test code, Technique 1 with the common counting method was the winner. Any application that needs micro optimization or where speed is of the essence, nothing else should be considered. You just need to decide when to use the single-threaded or multi-threaded technique depending on your application’s requirements.

 

Here’s the basis of that code. If it returns a value > 0, the substring was found:

 

Otherwise, if you don’t mind code running ever so slightly slower and want easy code readability, C#’s native String.Contains() method is the way to go. It’s a simple, elegant, one-liner, and performs relatively well in both single and multi-threaded renditions.

Unless you need the actual starting index in the string where the match is found, then when it comes to speed, everything else shouldn’t be considered.

 

The Code:

 

David Lozinski