Tuesday, March 24, 2009

How to Write a Regular Expression


What Regular Expression? 
A regular expression is a pattern that can match various text strings, used for validations.

Where and when to use Regular Expression?
It can be used in the programming languages which supports or has regular expression class as in built or it supports third party regular expression libraries.

Regular expressions can be used to valid different type of data without increase the code with if and case conditions. A number of if conditions can be omitted with single line of regular expression checking.

Benefits of Regular Expression:
The following are benefits (not all included) of use of Regular Expression.
a) # line of code can be reduced.
b) Speed Coding.
c) Easy maintenance (you don’t need to change if validation criteria changes, just check the regular expression string).
d) Easy to understand (you don’t need to understand the programmer logic on large if statements and case statements).

Elements of Regular Expression:
Here are the basic elements of regular expression characters/literals, which can be used to build big regular expressions:

^ ---->Start of a string.
$ ---->End of a string.
. ----> Any character (except \n newline)
{...}----> Explicit quantifier notation.
[...] ---->Explicit set of characters to match.
(...) ---->Logical grouping of part of an expression.
* ---->0 or more of previous expression.
+ ---->1 or more of previous expression.
? ---->0 or 1 of previous expression; also forces minimal matching when an expression might match several strings within a search string.
\ ---->Preceding one of the above, it makes it a literal instead of a special character. Preceding a special matching character, see below.
\w ----> matches any word character, equivalent to [a-zA-Z0-9]
\W ----> matches any non word character, equivalent to [^a-zA-Z0-9].
\s ----> matches any white space character, equivalent to [\f\n\r\v]
\S----> matches any non-white space characters, equivalent to [^\f\n\r\v]
\d ----> matches any decimal digits, equivalent to [0-9]
\D----> matches any non-digit characters, equivalent to [^0-9]

\a ----> Matches a bell (alarm) \u0007.
\b ----> Matches a backspace \u0008 if in a [] character class; otherwise, see the note following this table.
\t ---->Matches a tab \u0009.
\r ---->Matches a carriage return \u000D.
\v ---->Matches a vertical tab \u000B.
\f ---->Matches a form feed \u000C.
\n ---->Matches a new line \u000A.
\e ---->Matches an escape \u001B

$number ----> Substitutes the last substring matched by group number number (decimal).
${name} ----> Substitutes the last substring matched by a (? ) group.
$$ ----> Substitutes a single "$" literal.
$& ----> Substitutes a copy of the entire match itself.
$` ----> Substitutes all the text of the input string before the match.
$' ----> Substitutes all the text of the input string after the match.
$+ ----> Substitutes the last group captured.
$_ ----> Substitutes the entire input string.

(?(expression)yes|no) ----> Matches yes part if expression matches and no part will be ommited.


Simple Example:
Let us start with small example, taking integer values:
When we are talking about integer, it always has fixed series, i.e. 0 to 9 and we will use the same to write this regular expression in steps.

a) Regular expression starts with “^”
b) As we are using set of characters to be validated, we can use [].
c) So the expression will become “^[1234567890]”
d) As the series is continues we can go for “-“ which gives us to reduce the length of the expression. It becomes “^[0-9]”
e) This will work only for one digit and to make it to work for n number of digits, we can use “*”, now expression becomes “^[0-9]*”
f) As with the starting ending of the expression should be done with “$”, so the final expression becomes “^[0-9]*$”

Note: Double quotes are not part of expression; I used it just to differentiate between the sentences.

Is this the way you need to write:
This is one of the way you can write regular expression and depending on the requirements and personal expertise, regular expression could be compressed much shorter, for example above regular expression could be reduced as.

a) Regular expression starts with “^”
b) As we are checking for the digits, there is a special character to check for digits “\d”
c) And digits can follow digits , we use “*”
d) As expression ends with “$”, the final regular expression will become
"^\d*$”

Digits can be validated with different ways of regular expressions:

1) ^[1234567890]*$
2) ^[0-9]*$
3) ^\d*$

Which one to choose?
Every one of above expressions will work in the same way, choose the way you are comfort, it is always recommended to have a smaller and self expressive and understandable, as these will effect when you write big regular expression.

Example on exclude options:
There are many situation which demands us to exclude only certain portion or certain characters,
Eg: a) Take all alpha numeric and special symbols except “&”
b) Take all digits except “7”
then we cannot prepare a big list which includes all instead we use the symbol of all and exclude the characters / symbols which need to be validated.
Eg: “^\w[^&]*$” is the solution to take all alpha numeric and special symbols except “&”.

Other Examples:
a) There should not be “1” as first digit,?
^[^1]\d*$ ? this will exclude 1 as first digit.

b) There should not be “1” at any place?
^\d[^1]*$ ? this will exclude the 1 at any place in the sequence.

Note: Here ^ operator is used not only to start the string but also used to negate the values.

Testing of Regular expression:
There are several ways of testing this
a) You can write a windows based program.
b) You can write a web based application.
c) You can even write a service based application.


Windows base sample code:
Here are steps which will be used for regular expression checking in dotNet:

a) Use System.Text.RegularExpression.Regex to include the Regex class.
b) Create an Regex object as follows:
Regex regDollar= new System.Text.RegularExpressions.Regex("^[0-9]*$ ");
c) Call the IsMatch(string object) of the Regex call, which will return true or flase.
d) Depending on the return state you can decide whether passed string is valid for regular expression or not.]

Here is the snap shot code as function:

Public boolean IsValid(string regexpObj, string passedString)
{
//This method is direct method without any exceptional throwing..
Regex regDollar= new System.Text.RegularExpressions.Regex(regexpObj);
return regDollar.IsMatch(passedString);
}
With minor changes to the above function it can be used in windows or webbased or even as a service.

Another way -- Online checking:
At last if you are fed up with above and you have internet connection and you don’t have time to write sample, use the following link to test online

http://www.regexplib.com/RETester.aspx

MORE INFO:
You can find more information on these type of characters at

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpgenref/html/cpconcharacterescapes.asp
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpgenref/html/cpconcharacterclasses.asp
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpgenref/html/cpcongroupingconstructs.asp

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpgenref/html/cpconcharacterclasses.asp

--Here is the end of article, hope this basic build will definetely useful for writing a big and good Regular Expression ---

Express your code with REGULAR EXPRESSIONS