Monday 16 March 2009

Some regex for the new to the Web programmer

I was asked today how on a web form you could check to see if an email address was correctly formatted before the form was submitted. Well with the advent of Xpages and the pervasiveness of AJAX frameworks, web pages in the Domino World, like any web faced back end are not exempt from the double requirement of being shiny sexy things that users love AND having functions that can stop users doing the stupid things they always do.

The question was posed by a RPG programmer new to the world of the web and when I said "och that's a piece of piss, use a RegEx" they looked at me like I had grown an extra head (which i do do on occasions for extra effect when I am being flash)

RegEx is short for Regular Expression
and is a string that describes a search pattern. *.txt is a RegEx that will be familiar to everyone. But proper RegEx is like that but on steroids, human growth hormone and crack cocaine.

There are loads of web sites that go into great detail about how to correctly form a RegEx and I will let you go find them yourself .. there is thing called Google and it is really useful ;-)

However how to use RegEx and a RegEx pattern to test an email address?

OK lets say you have this HTML on a form
Email Address <input type='text' name='email' value=''>
I am going to use this is a JS function that will return true or false if the email address entered is valid
function emailOK()
myRegEx = /^\w+((-\w+)|(\.\w+))*\@[A-Za-z0-9]+((\.|-)[A-Za-z0-9]+)*\.[A-Za-z0-9]+$/
if (document.forms[0] != -1)
return true;
return false;
You will notice that I do NOT use a string literal surrounded by a pair of ""s for the myRegExp. Why? Because in a string literal \ is special (as in new line \n) so the RegEx expression \w+ would become "\\w+" in a sting literal which makes something that is very hard to read even harder to read. JS allows you to delimit RegEx's like this /pattern/ which makes things much nicer!

I pass this RegEx to the SEARCH method of the JS String object (which is what the .value of an input field is BTW) and it returns -1 is the pattern does not match or a positive integer if it does match. Which is all very simple BUT ... what does all that gooble-de-gook mean.. if you don't care .. then far enough, copy and paste the code and use it in your own app ;-) if you are curious here is what it means:

1. it begins with a / and the matching / is the very last character these act as the "" would in a normal string

2. The ^ symbol, means start matching at the absolute beginning of the string.

3. \w+ matches one or more alphanumeric characters.

4. This is where this gets interesting :-)
First, note that the whole thing is surrounded by (...)* which means that we want to match zero or more of whatever is inside.

Inside the ()* parentheses, we have (-\w+) and (\.\w+) separated by the | character This means it with match EITHER -\w+ OR \.\w+

The first one indicates that we should have a match if we find a hyphen followed immediately by a set of alphanumeric characters. The second part matches if we find a period followed immediately by a set of alphanumeric characters.

(NOTE . by itself is a special character so we must delimit it by placing a backslash in front of it.)

So far we have a pattern that will match [some alphanumerics]-[some alphanumerics] or [some aphanumerics].[some aphanumerics]

5. After this match, comes an \@ sign (once again i use \ to override any special meaning @ may have in RegEx

6. Immediately following the @ is [A-Za-z0-9]+ which matches a set of alphanumeric characters of either upper or lower case. I don't use \w+ because this would allow characters like _ and - ... OK?

7. next up is ((\.|-)[A-Za-z0-9]+)*

8. Note that the search is surround by ( ... )* this means we are matching one or none of a match to the pattern inside the ( ... )

9. So take them away and we are left with (\.|-)[A-Za-z0-9]+

10 First (\.|-) This is trying to match the first character as a . or a -

11. Then comes [A-Za-z0-9]+, the match only works if the period or hyphen is followed by a set of alphanumeric characters. This effectively represents an email address that contains a (possible) set of .something or -something sections. Because the * is used, the pattern works if they are present and also if they aren't.

12. Finally the \.[A-Za-z0-9]+ pattern matches a . followed by a set of alphanumerics. Because it is the last part of the regular expression, it represents the final part of the email address, which is the .com bit.

13. The $ symbol ensures that the previous bit only matchs the END of the string.

14. The last character (the /) closes the first /

Simple .. init? ;-)

Disqus for Domi-No-Yes-Maybe