Regular Expression, commonly known as RegEx is considered to be one
of the most complex concepts. However, this is not really true. Unless
you have worked with regular expressions before, when you look at a
regular expression containing a sequence of special characters like /,
$, ^, \, ?, *, etc., in combination with alphanumeric characters, you
might think it a mess. RegEx is a kind of language and if you have
learnt its symbols and understood their meaning, you would find it as
the most useful tool in hand to solve many complex problems related to
text searches.
Just consider how you would make a search for files on your computer.
You most likely use the ? and * characters to help find the files
you're looking for. The ? character matches a single character in a
file name, while the * matches zero or more characters. A pattern such
as 'file?.txt' would find the following files:
file1.txt
filer.txt
files.txt
Using the * character instead of the ? character expands the number
of files found. 'file*.txt' matches all of the following:
file1.txt
file2.txt
file12.txt
filer.txt
filedce.txt
While this method of searching for files can certainly be useful,
it is also very limited. The limited ability of the ? and * wildcard
characters give you an idea of what regular expressions can do, but
regular expressions are much more powerful and flexible.
Let Us Start on RegEx
A regular expression is a pattern of text that consists of
ordinary characters (for example, letters a through z) and special
characters, known as
metacharacters. The pattern describes
one or more strings to match when searching a body of text. The
regular expression serves as a template for matching a character
pattern to the string being searched.
The following table contains the list of some metacharacters and their behavior in the context of regular expressions:
Character |
Description |
\ |
Marks the next character as either a
special character, a literal, a backreference, or an octal escape.
For example, 'n' matches the character "n". '\n' matches a newline
character. The sequence '\\' matches "\" and "\(" matches "(". |
^ |
Matches the position at the beginning of the input string. |
$ |
Matches the position at the end of the input string. |
* |
Matches the preceding subexpression zero or more times. |
+ |
Matches the preceding subexpression one or more times. |
? |
Matches the preceding subexpression zero or one time. |
{n} |
Matches exactly n times, where n is a nonnegative integer. |
{n,} |
Matches at least n times, n is a nonnegative integer. |
{n,m} |
Matches at least n and at most m times, where m and n are nonnegative integers and n <= m. |
? |
When this character immediately follows any of the other quantifiers (*, +, ?, {n}, {n,}, {n,m}),
the matching pattern is non-greedy. A non-greedy pattern matches as
little of the searched string as possible, whereas the default greedy
pattern matches as much of the searched string as possible. |
. |
Matches any single character except "\n". |
x|y |
Matches either x or y. |
[xyz] |
A character set. Matches any one of the enclosed characters. |
[^xyz] |
A negative character set. Matches any character not enclosed. |
[a-z] |
A range of characters. Matches any character in the specified range. |
[^a-z] |
A negative range characters. Matches any character not in the specified range. |
\b |
Matches a word boundary, that is, the position between a word and a space. |
\B |
Matches a nonword boundary. 'er\B' matches the 'er' in "verb" but not the 'er' in "never". |
\d |
Matches a digit character. |
\D |
Matches a nondigit character. |
\f |
Matches a form-feed character. |
\n |
Matches a newline character. |
\r |
Matches a carriage return character. |
\s |
Matches any whitespace character including space, tab, form-feed, etc. |
\S |
Matches any non-whitespace character. |
\t |
Matches a tab character. |
\v |
Matches a vertical tab character. |
\w |
Matches any word character including underscore. |
\W |
Matches any nonword character. |
\un |
Matches n, where n is a Unicode character expressed as four hexadecimal digits. For example, \u00A9 matches the copyright symbol (©). |
|
RegEx functions in PHP
PHP has functions to work on complex string manipulation using
RegEx. The following are the RegEx functions provided in PHP.
Function |
Description |
ereg |
This function matches the text pattern in a string using a RegEx pattern. |
eregi |
This function is similar to ereg(), but ignore the case sensitivity. |
ereg_replace |
This function matches the text pattern in a string using a RegEx Pattern and replaces it with the given text. |
eregi_replace |
This is similar to ereg_replace(), but ignores the case sensitivity. |
split |
This function split string into array using RegEx. |
Spliti |
This is similar to Split(), but ignores the case sensitivity. |
sql_regcase |
This function create a RegEx from the given string to make a case insensitive match. |
|
Finding US Zip Code
Now let us see a simple example to match a US 5 digit zip code from a string
<?
$zip_pattern = "[0-9]{5}";
$str = "Mission Viejo, CA 92692";
ereg($zip_pattern,$str,$regs);
echo $regs[0];
?>
This script would output as follows
92692
The above example can also be rewritten using Perl-compatible regular expression syntax with preg_match() function.
<?
$zip_pattern = "/\d{5}/";
$str = "Mission Viejo, CA 92692";
preg_match($zip_pattern,$str,$regs);
echo $regs[0];
?>
Note the change in the RegEx pattern in both examples. preg_match() is considered as faster alternative for ereg().
RegEx for US Phone Numbers
Now let us try to create a RegEx pattern to match a US telephone
number. US telephone numbers are 10 digit numbers usually written with
three parts like xxx xxx xxxx. These three parts are normally used
with – hyphen, () braces, and blank spaces. The most common patterns
can be seen as follows:
XXX XXX XXXX
(XXX) XXX XXXX
XXX-XXX-XXXX
(XXX) XXX-XXXX
In some cases, US ISD code would be added in the first, like +1 XXX XXX XXXX.
Let us create a Perl-Compatible RegEx pattern to match the above
patterns. First we would need to match the single digit ISD code (let us
not restrict it to 1). But this may or may not available in the phone
numbers, hence we would write it as follows:
$Phone_Pattern = “/(\d)?/”;
Here \d is equivalent to 0-9 and the succeeding ‘?’ indicates that the digit may appear one time or doesn’t appear at all.
Now what would appear next in the sequence? The possibilities are a
blank space or a hyphen. So we would add the pattern “(\s|-)?” with the
above RegEx. This pattern indicates that either a blank space or a
hyphen may or may not appear. So our RegEx becomes:
$Phone_Pattern = “/(\d)?(\s|-)?/”;
The next sequence would be either XXX or (XXX). To match this
sequence, we need to first match the braces with the pattern “(\()?”. As
we use braces to enclose the patterns in RegEx, braces are
metacharacters and to match these metacharacters explicitly, we need to
use the escape character “\” preceding the metacharacters. Hence we
use “\(“ in our RegEx pattern. Now we need to match the three digits
and a closing braces. So this can be written as “(\d){3}(\))?”. Now our
RegEx is added with these patterns,
$Phone_Pattern = “/(\d)?(\s|-)?(\()?(\d){3}(\))?/”;
After the first part XXX, there should be either a blank space or a hyphen. So we add “(\s|-){1}” to the phone pattern.
$Phone_Pattern = “/(\d)?(\s|-)?(\()?(\d){3}(\))?(\s|-){1}/”;
Further construction of RegEx would be much more simpler, as we need
to match either XXX-XXXX or XXX XXXX. This could be written as
“(\d){3}(\s|-){1}(\d){4}”. Adding this part of pattern to our RegEx,
$Phone_Pattern = “/(\d)?(\s|-)?(\()?(\d){3}(\))?(\s|-){1}(\d){3}(\s|-){1}(\d){4}/”;
Yippee!!! We have created a RegEx to match US phone numbers.
Now we need to use this RegEx to perform some task, so that we can
understand the significance of RegEx better. Now let us try to script a
code to fetch the phone numbers from Google contact us page. So first
we need to fetch the html content from Google’s contact us page.
$str = implode("",file("http://www.google.com/intl/en/contact/index.html"));
Then we need to search for the phone number pattern with the help of
our “Just Created” RegEx. If we use the preg_match(), we can fetch only
one match. So to get more than one match we would use
preg_match_all().
preg_match_all($Phone_Pattern,$str,$phone);
Now putting all these pieces into a single script,
<?
$str = implode("",file("http://www.google.com/intl/en/contact/index.html"));
$Phone_Pattern = "/(\d)?(\s|-)?(\()?(\d){3}(\))?(\s|-){1}(\d){3}(\s|-){1}(\d){4}/";
preg_match_all($Phone_Pattern,$str,$phone);
for($i=0;$i<count($phone[0]);$i++)
{
echo $phone[0][$i]."<br>";
}
?>
This script will display the following output,
(650) 253-0000
(650) 253-0001
Wrap Up
Hope you had a good session with RegEx and now you would have
some understanding on tackling problems related to text pattern
findings using RegEx. To become a specialist in RegEx, you need to
continuously practice it and need to identify complex problems and give
a try to solve them. Happy Practicing With RegEx.