Tuesday, March 24, 2009

Regular Expression in C#

Regular Expression is a language independent feature supported by many languages, notably PERL, Java, JavaScript, C# etc. The support for Regular Expression is extensive under PERL and thus there is a term coined PCRE (Perl Compatible Regular Expression).

.NET has followed the similar pattern writing syntax.

 

The Base Class Library includes a namespace (System.Text.RegularExpressions) where a set of classes have been exposed to utilize the power of regular expressions.

 

Summarizing the widely used classes to utilize the power of regular expressions under C#:

 

Use static methods of Regex class or instance method to match a pattern or replace a pattern. After successful match the result of a regular expression is a collection (MatchCollection) of Match objects. Within each Match object is a collection (GroupCollection) of Group objects. Each Group object within the GroupCollection represents either the entire match or a sub-match that was defined via parenthesis. Within each Group object is a collection (CaptureCollection) of Capture objects. Each Capture object contains the results from a single subexpression capture.

 

I will try to explain each of them with some example for better understanding.

 

Regex class provides several static methods to enable you check for match or get matches without even instantiating the Regex object.

 

Escape: Escapes all meta-characters within a pattern string.

Unescape: Un-escapes any escaped meta-characters within a pattern string.

 

IsMatch: A Boolean value is returned depending on whether the pattern is matched in the string or not.

Match: A match instance is returned for the first string matched as defined by pattern.

Matches: A collection of matches are returned as MatchCollection

Replace: Replaces the first occurrence of the pattern in the string.

Split: Split the strings over pattern to get an array of strings.

 

Except Escape and Unescape all of the above methods are also available as instance members of Regex class. The static methods are provided to allow an isolated, single use of a regular expression without explicitly creating a Regex object.

 

Let’s write some sample code to see how regular expression works in C#:

 

Sample1:

 

string content = "123abbbabbaaa123baaaabbbbcccaaa123cccbbb123";

// Match more than once occurrence of 'a'

string pattern = "a+";

if(Regex.IsMatch(content, pattern))

Console.WriteLine("Pattern Found");

else

      Console.WriteLine("Pattern Not Found");

 

Output:  Pattern Found

 

 

Sample2:

 

string content = "123abbbabbaaa123baaaabbbbcccaaa123cccbbb123";

// Match one or more than one occurrence of 'a', using ^ and $ enforces that

// whole string must be matched

string pattern = "^a+$";

if(Regex.IsMatch(content, pattern))

      Console.WriteLine("Pattern Found");

else

      Console.WriteLine("Pattern Not Found");

 

Output:  Pattern Not Found

 

 

Sample3:

 

string content = "123abbbabbaaa123baaaabbbbcccaaa123cccbbb123";

// Match all digits (at least one) which is preceded by one or more than one

// occurrence of 'a' and optionally followed by 'b'

string pattern = @"a+(\d+)b*";

MatchCollection mc = Regex.Matches(content, pattern);

string spacer = "";

if(mc.Count > 0)

{

      Console.WriteLine("Printing matches...");

      for(int i =0; i <>

      {

            spacer = "";

            Console.WriteLine();

            Console.WriteLine(spacer+ "Match["+i+"]: "+ mc[i].Value);                    

            Console.WriteLine(spacer+ "Printing groups for this match...");

            GroupCollection gc = mc[i].Groups;

            for(int j =0; j <>

            {

                  spacer = " ";

                  Console.WriteLine(spacer+ "Group["+j+"]: "+ gc[j].Value);                                

                  Console.WriteLine(spacer+ "Printing captures for this group...");

                  CaptureCollection cc = gc[j].Captures;

                  for(int k =0; k <>

                  {

                        spacer = "  ";

                        Console.WriteLine(spacer+ "Capture["+k+"]: "+ cc[k].Value);                              

                  }

            }                            

      }

}

else

{

      Console.WriteLine("Pattern Not Found");

}

 

Output:

Printing matches...

 

Match[0]: aaa123b

Printing groups for this match...

 Group[0]: aaa123b

 Printing captures for this group...

  Capture[0]: aaa123b

 Group[1]: 123

 Printing captures for this group...

  Capture[0]: 123

 

Match[1]: aaa123

Printing groups for this match...

 Group[0]: aaa123

 Printing captures for this group...

  Capture[0]: aaa123

 Group[1]: 123

 Printing captures for this group...

  Capture[0]: 123

 

 

Here the first two samples are simple enough to understand. In ‘Sample 1’ I am just checking whether one or more than one consecutive ‘a’ (i.e. at least 2 ‘a’s) occurs in the given string.  And similarly in ‘Sample 2’ I am checking whether the given string starts and ends with ‘a’ and may contains more ‘a’s (which is false as we have other characters also in the given string).

 

Now let us examine the third sample, here you might be thinking why .NET has so many different classes Match, Group, and then Capture. With this example the difference between these are not very evident. Here we find Match[0], Group[0] and Capture[0] all containing the same value.

 

Let us see what MSDN says about these:

 

Match: The Match class represents the results of a regular expression matching operation.

Group: The Group class represents the results from a single capturing group.

Capture: The Capture class contains the results from a single subexpression capture.

 

In simple words Match produces everything that is matched by given pattern, when regular expression search is performed on the given string. There can be multiple occurrence of pattern in the given string, in that case you can get the collection of such matches by calling the Matches API provided byRegex class.

 

The Group is a part of given pattern enclosed by the ‘(‘ and ‘)’. So a group contains the part of matched string which is matched by the subpattern enclosed under brackets. Exception to this the Group[0] always contains the whole match (same as the value of Match).

 

The Capture is a part of string matched by the group expression i.e. the string matched by a subexpression of group expression. To understand it better I will slightly modify the ‘Sample 3’ to make single digit match as a subexression of the group.

 

Sample4:

 

string content = "123abbbabbaaa123baaaabbbbcccaaa123cccbbb123";

// Match all 1,2 or 3 (at least once) which is preceded by one or more than one

// occurrence of 'a' and optionally followed by 'b'

string pattern = @"a+(1|2|3)+b*";

MatchCollection mc = Regex.Matches(content, pattern);

string spacer = "";

if(mc.Count > 0)

{

      Console.WriteLine("Printing matches...");

      for(int i =0; i <>

      {

            spacer = "";

            Console.WriteLine();

            Console.WriteLine(spacer+ "Match["+i+"]: "+ mc[i].Value);                    

            Console.WriteLine(spacer+ "Printing groups for this match...");

            GroupCollection gc = mc[i].Groups;

            for(int j =0; j <>

            {

                  spacer = " ";

                  Console.WriteLine(spacer+ "Group["+j+"]: "+ gc[j].Value);                                

                  Console.WriteLine(spacer+ "Printing captures for this group...");

                  CaptureCollection cc = gc[j].Captures;

                  for(int k =0; k <>

                  {

                        spacer = "  ";

                        Console.WriteLine(spacer+ "Capture["+k+"]: "+ cc[k].Value);                              

                  }

            }                            

      }

}

else

{

      Console.WriteLine("Pattern Not Found");

}

 

Output:

 

Printing matches...

 

Match[0]: aaa123b

Printing groups for this match...

 Group[0]: aaa123b

 Printing captures for this group...

  Capture[0]: aaa123b

 Group[1]: 3

 Printing captures for this group...

  Capture[0]: 1

  Capture[1]: 2

  Capture[2]: 3

 

Match[1]: aaa123

Printing groups for this match...

 Group[0]: aaa123

 Printing captures for this group...

  Capture[0]: aaa123

 Group[1]: 3

 Printing captures for this group...

  Capture[0]: 1

  Capture[1]: 2

  Capture[2]: 3

 

You must be getting confused over the strings matched by Group[1] in this example. Again quoting from MSDN:

 

Because Group can capture zero, one, or more strings in a single match (using quantifiers), it contains a collection of Capture objects. Because Groupinherits from Capture, the last substring captured can be accessed directly (the Group instance itself is equivalent to the last item of the collection returned by the Captures property).