By Vikram Vaswani and Harish Kamath
April 12, 2000
Printed from DevShed.com
URL: http://www.devshed.com/Server_Side/Administration/RegExp
Introduction
Ask any relatively-experienced *NIX user to list his top ten favorite
things about the operating system, and you're almost certain to hear him
mutter, somewhere between "99% uptime" and "remote system reboots", the
phrase "regular expressions".
Ask any relatively-experienced *NIX user to list the ten things he hates
most about the operating system, and somewhere between "zombie processes"
and "installation", he's sure to spit out the phrase "regular expressions".
It's precisely this complex love-hate equation that spawned the idea for a
tutorial on regular expressions - surely, went the reasoning, something
that induced such strong emotions in normally hard-headed *NIX
administrators was worthy of investigation. And so, regardless of whether
you're new to regular expressions, or an old hand at putting them together,
the next few pages should help you resolve your conflicted feelings on the
subject. Hey - it *is* cheaper than therapy...
And First There Was Love...
Regular expressions, also known as "regex" by the geek community, are a
powerful tool used in pattern-matching and substitution. They are commonly
associated with almost all *NIX-based tools, including editors like vi,
scripting languages like Perl and PHP, and shell programs like awk and sed.
You'll even find them in client-side scripting languages like JavaScript -
kinda like Madonna, their popularity cuts across languages and territorial
boundaries...
A regular expression lets you build patterns using a set of special
characters; these patterns can then be compared with text in a file, data
entered into an application, or input from a form filled up by users on a
Web site. Depending on whether or not there's a match, appropriate action
can be taken, and appropriate program code executed.
For example, one of the most common applications of regular expressions is
to check whether or not a user's email address, as entered into an online
form, is in the correct format; if it is, the form is processed, whereas if
it's not, a warning message pops up asking the user to correct the error.
Regular expressions thus play an important role in the decision-making
routines of Web applications - although, as you'll see, they can also be
used to great effect in complex find-and-replace operations.
A regular expression usually looks something like this:
/love/
All this does is match the pattern "love" in the text it's applied to. Like
many other things in life, it's simpler to get your mind around the pattern
than the concept - but then, that's neither here nor there...
How about something a little more complex? Try this:
/fo+/
This would match the words "fool", "footsie" and "four-seater". And
although it's a pretty silly example, you have to admit that there's truth
to it - after all, who but fools in love would play footsie in a
four-seater?
The "+" that you see above is the first of what are called
"meta-characters" - these are characters that have a special meaning when
used within a pattern. The "+" metacharacter is used to match one or more
occurrence of the preceding character - in the example above, the letter
"f" followed by one or more occurrence of the letter "o".
Similar to the "+" meta-character, we have "*" and "?" - these are used to
match zero or more occurrences of the preceding character, and zero or one
occurrence of the preceding character, respectively. So,
/eg*/
would match "easy", "egocentric" and "egg"
while
/Wil?/
would match "Winnie", "Wimpy" "Wilson" and "William", though not "Wendy" or
"Wolf".
In case all this seems a little too imprecise, you can also specify a range
for the number of matches. For example, the regular expression
/jim{2,6}/
would match "jimmy" and "jimmmmmy!", but not "jim". The numbers in the
curly braces represent the lower and upper values of the range to match;
you can leave out the upper limit for an open-ended range match.
Of Carrots, Bombshells And Four-Figure Incomes
Now that you've got the basics down, how about taking it to the next level?
It's also possible to search for white space, numbers and alphabetic
characters with a regular expression - and here's the merry gang of
meta-characters that will help you do just that:
\s = used to match a single white space character, including tabs and
newline characters
\S = used to match everything that is *not* a white space character
\d = used to match numbers from 0 to 9
\w = used to match letters, numbers and underscores
\W = used to match anything that does not match with \w
. = used to match everything except the newline character
Now, you're probably thinking, "Hey, that's great - but what does it all
mean?!". Well, suppose you wanted to find all the white space in a
document...
/\s+/
Easy, isn't it? If you're looking only for numbers, try
/\d/
So, if you had a complex financial spreadsheet in front of you, and you
wanted to quickly find all amounts of a thousand dollars or more, you could
use
/\d000/
How about limiting your search to the beginning or end of a string? Well,
that's why we have "pattern anchors" - these simply tie your regular
expression to either the first or last character of the string, and come in
very useful when you're looking for a way to filter through a mass of
matches.
There are two basic pattern anchors - the first one is represented by a
caret [^], and is used to indicate that the expression should be matched
only at the beginning of the string that it is applied to. For example, the
expression
/^hell/
will return a match only if it finds a word beginning with "hell" - "hello"
and "hellhound", but not "shell".
And similarly, to match the end of a string, there's the "$" pattern
anchor. So
/ar$/
would match "scar", "car" and "bar", though not "art", "army" or "arrow".
There's also a simpler way to add pattern anchors to your expression - the
\b meta-character. This is used to check that the regex matches the
boundary of a string, and it can be placed either at the beginning or end
of the pattern to be matched - like this:
/\bbom/
This would match both "bombay" and "bombshell", while
/man\b/
would match "human", "woman" and "man", though not "manitou" or
"mannequin". And the converse of this is \B, which matches everywhere but
at the boundaries of a string.
Ranging Far And Wide...
Just as you can specify a range for the number of characters to be matched,
you can also specify a range of characters. For example, the range
/[A-Z]/
would match a single instance of all upper-case alphabetic characters, while
/[a-z]/
would match all lowercase letters, and
/[0-9]/
would match all numbers between 0 and 9.
Using these three ranges, it's pretty easy to create a regular expression
to match an alphanumeric field.
/([a-z][A-Z][0-9])+/
would match a string that was purely alphanumeric in nature, like "aB0" -
although not "abc". Note the parentheses around the patterns - contrary to
what you might think, these are not there purely to confuse you; they come
in handy when grouping sections of a regular expression together.
Choice is very important when building regular expressions - as in most
other languages, it's possible to use the pipe [|] operator to indicate
multiple options in a regex. For example,
/to|too|2/
would match any one of the three strings "to", "too" and "2". As you can
imagine, this comes in pretty useful when building expressions that have
many possible variants.
You can also invert the regular sense of a regular expression with the
negation operator, represented by a caret [^] - so the pattern
/[^A-C]/
would match everything but that which appears in the expression - namely,
everything except the letters "A", "B" and "C". Note how the caret, when
used in a bracketed expression, is used to invert the match; this behaviour
is different from when it is used outside a bracketed expression, where it
serves as a pattern anchor.
And finally, one important thing to remember - should you decide, for
reasons best known to you and your mental health specialist, to add any of
the meta-characters described above to your pattern and explicitly match
them, you need to "escape" then with a back slash [\]. So, the pattern
/Th\*/
would match "Th*" but not "The" - the \* ensures that the asterisk is
matched as a literal character, not a meta-character.
How To Say "Ummmm...." In Three Different Languages
Now that we've got all that out of the way, let's take a closer look at
some examples of how regular expressions are used in Perl, PHP and
JavaScript. In Perl, for example, you can perform some pretty advanced
pattern matching using both the rules you've already learnt, and some
Perl-specific additions.
A pattern-matching command in Perl usually looks like this:
operator / regular-expression / string-to-replace / modifiers
Let's take a closer look at each of these components.
The operator can either be an "m" or an "s", depending on the purpose of
the regular expression -"m" is used for "match" operations only, while "s"
is used for "substitution" operations.
The regular expression is the pattern that is to be matched. This pattern
can be constructed using a variety of characters, meta-characters and
pattern anchors.
The string to replace is...well, the string to be replaced in a
find-and-replace operation. Yeah, every once in a while, we slip you an
easy one.
Finally, the modifiers are used to control the manner in which a particular
regex is applied. There are a whole bunch of modifiers, some of them with
pretty exotic names; unfortunately, none of them are single, or interested
in going out to dinner with you.
So, the statement
s/love/lust/
would replace the first occurrence of the word "love" with "lust". And if
you wanted to perform a global search-and-replace operation, you'd use the
"g" modifier, like this
s/love/lust/g
And they say romance is dead!
You can also use case-insensitive pattern matching - simply add the "i"
modifier, as in the following example, and watch in awe as Perl matches
"jewel", "Jewel" and "JEWEL".
m/JewEL/i
In Perl, all interaction with regular expressions takes place via an
equality operator, represented by =~; this is used as follows.
$flag =~ m/abc/
$flag returns true if $flag contains "abc"
$flag =~ s/abc/ABC/
replaces abc in the variable $flag with ABC
And here's an example of a simple Perl program which asks for your email
address, and compares it with a regex to verify whether or not it's in the
correct format.
#!/usr/bin/perl
# get input
print "So what's your email address, anyway?\n";
$email = <STDIN>;
chomp($email);
# match and display result
if($email =~ /^([a-zA-Z0-9_-])+@([a-zA-Z0-9_-])+(\.[a-zA-Z0-9_-])+/)
{
print("Ummmmm....that sounds good!\n");
}
else
{
print("Hey - who do you think you're kidding?\n");
}
As you can see, the most important part of this program is the regular
expression - it's been dissected below:
^([a-zA-Z0-9_-])+@([a-zA-Z0-9_-])+(\.[a-zA-Z0-9_-])+
The first part
^([a-zA-Z0-9_-])
matches the username part of the email address - this could be either a
number, a character, or a combination of both.
This is followed by an @ symbol, which is followed by the domain part of
the address; this could again include letters or numbers, and uses a period
as a delimiter - not our usage of an "escaped" period and the "+"
meta-character to represent these conditions in the second half of the
expression
([a-zA-Z0-9_-])+(\.[a-zA-Z0-9_-])+
Obviously, this is simply an illustrative example - if you're planning to
use it on your Web site, you need to refine it a bit. For example, the
script above won't accept email addresses of the form
firstname.lastname@somedomain.com - although such addresses are also pretty
common on the Web. You have been warned!
If you prefer PHP to Perl, you need to use the ereg() function for all
pattern matching operations,this usually takes the format
ereg(pattern, string)
where "pattern" is the pattern to be matched, and "string" is the character
string to be searched for the pattern. The next example should illustrate
this a little more clearly:
<?php
if (ereg("^([a-zA-Z0-9_-])+@([a-zA-Z0-9_-])+(\.[a-zA-Z0-9_-])+",$email))
{
echo "Ummmmm....that sounds good!";
}
else
{
echo "Hey - who do you think you're kidding?";
}
?>
And finally, JavaScript. JavaScript 1.2 comes with a powerful RegExp()
object, which can be used to match patterns in strings and variables. The
important thing here is the test() method, which searches for a pattern in
a string or variable, and returns either true or false - it‚s illustrated
in the example below.
<html>
<head>
<script language="Javascript1.2">
<!-- start hiding
function verifyAddress(obj)
{
// obtain form value into variable
var email = obj.email.value;
// define regex
var pattern = /^([a-zA-Z0-9_-])+@([a-zA-Z0-9_-])+(\.[a-zA-Z0-9_-])+/;
// test for pattern
flag = pattern.test(email);
if(flag)
{
alert("Ummmmm....that sounds good!");
return true;
}
else
{
alert("Hey - who do you think you're kidding?");
return false;
}
}
// stop hiding -->
</script>
</head>
<body>
<form onSubmit="return verifyAddress(this);">
<input name="email" type="text">
<input type="submit">
</form>
</body>
</html>
Obviously, there's a whole lot more that you can do with regular
expressions - checking email addresses is just the tip of the iceberg. You
can use regular expressions to validate phone numbers, currency figures,
Web site URLs, and a whole lot more - all you need is a little bit of
creativity and patience, a few slices of leftover pizza...and a therapist
who cares.
Note: All program code and examples in this article have been tested on
Linux 2.2.13/i386 with Perl 5.004, PHP 3.0.9 and Javascript 1.2.
This article copyright Melonfire
2000-2002. All rights reserved.