17

C# .Net: Fastest Way to check if a string occurs within a string

Using C# .Net: Fastest Way to check if a string occurs within a string.

This will examine many techniques to determine in C# .Net: Fastest Way to check if a string occurs within a string.

 

The Background:

How many of us C# programmers have had to check if a string is contained within another string? A simple match. We don’t care how many times it may exist, we only want to know if it does.

There are numerous native C# methods for doing this: String.Contains(), String.IndexOf(), through Regex regular expressions, and similar options for those programmers obsessed with LINQ.

So that’s when this curious consultant started wondering… what is the fastest way to test and see if a string is contained within another string?

 

The Set Up:

I wrote a C# Console application to test 16 different search techniques: 8 single threaded loop; 8 multi-threaded parallel.for loop.

 

The code is written in Visual Studio 2012 targeting .Net Framework version 4.5 x64. The source code is available at the end of this blog so you can benchmark it on your own system if you wish.

 

In a nutshell, the code does the following:

1) Creates an array of random strings that will serve as the strings to be searched using the System.Web.Security.Membership.GeneratePassword method. We’ll call these “A”.

2) Creates an array of random strings that will serve as the strings being searched for using the System.Web.Security.Membership.GeneratePassword method. We’ll call these “B”.

3) Executes the techniques to see if any of the search strings occur in the strings to be searched. The techniques are as follows:

#

Technique

Code Snippet

T1

A common counting method

T2

splitting each string in A by string B and counting the results.

T3

C#’s String.Contains() method.

T4

C#’s String.IndexOf() method

T5

Using Contains() on a LINQ query

T6

Using IndexOf() on a LINQ query

T7

C#’s Regex.IsMatch() method

T8

AsParallel() with Regex

T9

T1 implemented with a Parallel.For construct

T10

T2 implemented with a Parallel.For construct

T11

T3 implemented with a Parallel.For construct

T12

T4 implemented with a Parallel.For construct

T13

T5 implemented with a Parallel.For construct

T14

T6 implemented with a Parallel.For construct

T15

T7 implemented with a Parallel.For construct

T16

T8 implemented with a Parallel.For construct

4) The sum from adding up the number of times a match is found is displayed along with the execution time

5) The arrays are cleared out and deleted being the responsible programmer I am. 🙂

 

The code assumes all tests will happen with no exception testing because I’m just wanting to test the raw speed of finding a string in a string. There may be other techniques, but I think these are the most common. However, if you have a great alternative solution, please post!

 

The exe file was installed and run on an Alienware M17X R3 running Windows 7 64-bit with 16 GB memory on an i7-2820QM processor.

 

The test was run for each technique over the following number of comparisons:

  • Searching for 1 string in 5,000, 25,000, 100,000, and 1,000,000 million strings.
  • Searching for 100 strings in 5,000, 25,000, 100,000, and 1,000,000 million strings.
  • Searching for 1000 strings in 5,000, 25,000, 100,000, and 1,000,000 million strings.

 

In regards to the last two bullet points, I do realize though that in most cases, programmers are only going to test if 1 or 2 strings occur in another string. But:

1) We’re having fun aren’t we?

2) Why not?

 

Off To The Races!

Before starting, my hypothesis was that I expected the custom counting method to be the winner in both the single and multi-threaded groups since it generally does perform the best whether counting substring occurrences in strings or characters in strings. It’s amazing how fast and versatile this method is.

 

All times are indicated in hours:minutes:seconds.milliseconds format. Yes, some took over an hour! The lower the number, the faster the technique performed.

Green cells indicate the winner(s) for that run.

Yellow cells indicate the second runner(s) up.

 

Searching for 1 string in each of

5,000

25,000

100,000

1,000,000

T1: Common Counting

0.00

0.00

0.0100000

0.0690040

T2: Count Split String

0.00

0.01

0.0300001

0.2390137

T3: String.Contains()

0.00

0.00

0.0100000

0.1190068

T4: String.IndexOf()

0.00

0.00

0.0200000

0.1220070

T5: Linq.Contains()

0.00

0.00

0.0100000

0.0880050

T6: Linq.IndexOf()

0.00

0.0100001

0.0100001

0.1200069

T7: Regex.IsMatch()

0.01

0.01

0.0500000

0.5340306

T8: AsParallel() with Regex

0.1090063

0.5928010

2.4271172

28.0040902

T9: T1 using Parallel.For

0.04

0.00

0.00

0.0220012

T10: T2 using Parallel.For

0.00

0.01

0.0100000

0.0770044

T11: T3 using Parallel.For

0.00

0.00

0.0100001

0.0240014

T12: T4 using Parallel.For

0.00

0.00

0.00

0.0340019

T13: T5 using Parallel.For

0.00

0.00

0.0100000

0.0920053

T14: T6 using Parallel.For

0.00

0.00

0.0100000

0.1210069

T15: T7 using Parallel.For

0.00

0.02

0.0600001

0.5930339

T16: T8 using Parallel.For

9.5066852

0.9984017

4.1209639

02:40.4553675

 

Searching for 100 strings in each of

5,000

25,000

100,000

1,000,000

T1: Common Counting

0.0310018

00.1500003

00.6110350

00:05.9901200

T2: Count Split String

0.0890051

00.4200005

01.7090514

00:17.0874199

T3: String.Contains()

0.0400023

00.2000003

00.7900011

00:08.0801731

T4: String.IndexOf()

0.0580033

00.2800004

01.1000016

00:11.1742954

T5: Linq.Contains()

0.0400023

00.2000003

00.8000011

00:08.3932904

T6: Linq.IndexOf()

0.0560032

00.2700004

01.1200016

00:11.4832952

T7: Regex.IsMatch()

1.9920827

10.0042932

39.9819828

06:18.0251803

T8: AsParallel() with Regex

2.6716222

13.2043531

47.6943911

07:51.0616836

T9: T1 using Parallel.For

0.0100000

0.0300000

00.1320076

00:01.3670726

T10: T2 using Parallel.For

0.0400001

0.1700014

00.9670553

00:08.1112882

T11: T3 using Parallel.For

0.0100000

0.0610007

00.2090119

00:02.1291190

T12: T4 using Parallel.For

0.0100000

0.0720041

00.2890166

00:02.9511688

T13: T5 using Parallel.For

0.0100000

0.0580033

00.2230127

00:02.4241386

T14: T6 using Parallel.For

0.0200000

0.0830025

00.3260187

00:03.3881732

T15: T7 using Parallel.For

1.5900023

7.9991942

56.9313084

13:04.8093665

T16: T8 using Parallel.For

1:28.8255242

7:37.3664312

35.9700924

06:47.5672122

 

Searching for 1,000 strings in each of

5,000

25,000

100,000

1,000,000

T1: Common Counting

00.3400005

0:01.4500020

0:05.8460253

0:00:58.7804783

T2: Count Split String

00.9260396

0:04.1800059

0:16.9595644

0:02:48.1291422

T3: String.Contains()

00.4140236

0:02.0341091

0:08.0100821

0:01:20.2320263

T4: String.IndexOf()

00.5780331

0:02.8061605

0:11.0782430

0:01:51.8527094

T5: Linq.Contains()

00.4080233

0:01.9950192

0:08.1842706

0:01:23.5070696

T6: Linq.IndexOf()

00.5830334

0:02.7400039

0:11.2912949

0:01:54.8588055

T7: Regex.IsMatch()

20.3724376

1:40.5314815

6:28.7074927

1:06:19.6771386

T8: AsParallel() with Regex

16.1745050

1:21.2604800

5:29.5454050

0:56:01.1048914

T9: T1 using Parallel.For

00.0770011

0:00.3400194

0:01.3430768

0:00:13.4787570

T10: T2 using Parallel.For

00.3550013

0:01.7851021

0:06.7901925

0:01:12.7307796

T11: T3 using Parallel.For

00.1160039

0:00.5270302

0:02.1261188

0:00:23.4993413

T12: T4 using Parallel.For

00.1510086

0:00.7440425

0:02.9851708

0:00:33.0888926

T13: T5 using Parallel.For

00.1140065

0:00.5470291

0:02.2521288

0:00:25.5383988

T14: T6 using Parallel.For

00.1560067

0:00.7890429

0:03.1391778

0:00:35.0009991

T15: T7 using Parallel.For

18.2395857

2:57.7773771

6:24.3384817

1:45:56.4084164

T16: T8 using Parallel.For

14.9139029

1:15.3743422

4:58.9888284

0:52:50.3983364

 

The Results:

It’s quite obvious technique T1 implementing a common counting method won by running faster every single time than any of the native C# methods, Linq queries, or Regex matching. As the old saying goes, “green across the board”.

 

To my surprise, String.Contains() consistently ran second fastest, and stood out from the rest of the pack of contenders as we increased the number of strings we were searching for.

 

I’m also surprised the String.IndexOf() ran so slow in comparison. Definitely a method to avoid using except when necessary.

 

The Linq.Contains() technique isn’t a bad alternative to using String.Contains(). Personally, I’m not crazy about Linq and avoid it like the plague, but I have to give it a fair assessment in these tests.

 

Regex is a powerful tool, but as one can see from the results, if you know you’re going to perform checks 100 times or more for a simple string rather than a complicated pattern, Regex should not be used.

 

In Summary:

On my system, unless someone spots a flaw in my test code, Technique 1 with the common counting method was the winner. Any application that needs micro optimization or where speed is of the essence, nothing else should be considered. You just need to decide when to use the single-threaded or multi-threaded technique depending on your application’s requirements.

 

Here’s the basis of that code. If it returns a value > 0, the substring was found:

 

Otherwise, if you don’t mind code running ever so slightly slower and want easy code readability, C#’s native String.Contains() method is the way to go. It’s a simple, elegant, one-liner, and performs relatively well in both single and multi-threaded renditions.

Unless you need the actual starting index in the string where the match is found, then when it comes to speed, everything else shouldn’t be considered.

 

The Code:

 

(Visited 2,146 times, 2 visits today)
  • Some Guy Somewhere

    Did you try IndexOf() with StringComparer.Ordinal?

  • maxkoryukov

    WHAAAT???? It is a nonsense!

    The first note: the program source contains errors (mismatching quotes).

    Go next.

    I’ve taken the best method *T1*, and tried to use it in other environment. Do you know what? It works worse, than default string.Contains

    Let’s try to analyze:
    1. Contains. Go through the string, and compare char by char.
    2. CUSTOM (T1): measure length (for both strings), create a new string (calling Replace), and then subtract and divide integers.

    How, tell me, how the second could work faster than the first one?? If you still have doubts, think, how the Replace works. It should traverse the string and SEARCH THE SUBSTRING, and generate new string without substring (COPY chars), plus additional actions in our code..

    I don’t know, who has wrote this article, and I don’t know why this article is the first search result in google. But I would like to give an advice to author:

    Results of my test-run app are here: https://ci.appveyor.com/project/maxkoryukov/contentgrabber-addon/build/0.2.1.17-master#L67

    Sources: https://github.com/maxkoryukov/ContentGrabber.Addon/blob/master/TestSubstrings/Program.cs

    PS: also, you could think about this: https://blog.codinghorror.com/the-sad-tragedy-of-micro-optimization-theater/

    • StackOverflow41

      https://uploads.disquscdn.com/images/525bcef957163bfa6353ed2b5ed5327d2bee65944f408c6d7979b8e5be919680.jpg

      Hmmmm… I took your code using Visual Studio 2015 on a Windows 10 Pro machine, compiled for X64 with the option “optimize code” ticked, and the screen capture shows my results. The “custom method” ran faster on my machine too, but not by as much a difference as the author experienced.

      A point of the article is the code for the various techniques were provided, and in his/her environment (which they stated) at the time, that one technique proved to be the fastest.

      I don’t doubt your results or the author’s results. The author’s article is 3.5 years old, and who knows what underlying architecture/optimizations have been changed or tweaked in .Net, Windows, or anything else that would affect the results.

      Since the source is supplied, people can try every method to see which works best in their environment and adapt accordingly.

  • Manuel Afonso

    Hello. Thank you very much for sharing!
    Maybe we could consider adding StringComparison.Ordinal when calling IndexOf.
    In my current project, that made a big difference.

    • FireMystdl

      Thanks for the suggestion! I’ll leave it with others to experiment with in hopes it helps them too. 🙂

  • James Dartnell

    I know I am just digging this up from the past, but I am curious why the in the text code the Regex test was done without it being compiled? Surely that would provide a marked improvment in speed of execution

    • FireMystdl

      Not as much as you might think. In all the tests I’ve been running, Regex, compiled or otherwise, always comes out slow. Code is there. Try it. I predict you might see a minor speed improvement relatively speaking, but nothing significant enough to knock the other methods from their top spots. But let us know how you go. 🙂

      • James Dartnell

        I’m really surprised about that considering how many people always sing its praises, although the other methods are impressively fast! I’m not doubting the testing at all btw, just asking if someone had tried it already haha 🙂

        • FireMystdl

          I know. Not taken personally or anything. 🙂

          Don’t get me wrong about Regex. It’s great and definitely has its useful purposes. However, micro-optimization for speed isn’t one of them.

          What I would be curious though is how fast the .Net Regex is vs Perl’s regular expressions implementation. That would be a cool comparison to see who has done it faster. 🙂

  • Jg

    contains versus index of (native) results are quite weird knowing the fact that the index of method is used within the contains().

  • jeff

    can someone please explain why it is necessary to divide by the search string length? Seems unnecessary to me…

  • Jason Coyne

    Is there any method for using T1 in a case insensitive manner?

    • Dave

      Yes. Use the ToLower() or ToUpper() methods to make both strings the same.

      Try this:
      (ss.Length – ss.ToLower().Replace(sf.ToLower(), String.Empty).Length) / sf.Length

      or convert your strings to all lower or upper case before calling Technique T1.

      Obviously doing this may affect the execution time, but hopefully not too much. I’ll leave that as an exercise to a reader to benchmark. 🙂

  • I suspect that you may get a slight performance gain by modifying your T1 method to use foreach() instead of for() in both loops. Since foreach() is a read only iteration, it usually performs slightly better than for(). The following test case showed slight performance gains over T1 for all test cases > 5000:

    int i = 0;
    foreach(string SS in ss)
    {
    foreach (string SF in sf)
    {
    c[i] += ((SS.Length – SS.Replace(SF, String.Empty).Length) / SF.Length > 0 ? 1 : 0);
    i++;
    }
    i = 0;
    }

  • seo

    Your style is so unique in comparison to other folks I’ve read stuff from.

    Thanks for posting when you’ve got the opportunity, Guess I’ll
    just book mark this blog.

    • Dave

      Thank you! I’m glad it helps.

  • Good analysis. Though typically for timed tests you will want to use the Stopwatch class versus DateTime.Now. Though it’s less important when measuring events that take minutes or hours. Also, in most real world situations we are searching for strings in human text (words and sentences), not password generated random strings. With human text, there will be many more possible matches and “soft matches” where the letters match but not the case. Though it would certainly affect the overall execution times, I’m not sure it would change the relative rankings from your test. Finally, your advice on Regex is correct. Due to the extensive overhead it’s best to use Regex only when you need flexible or advanced pattern matching. Thanks for sharing!