Regular Expressions Part XII: Bad Greed

December 2, 2006

Now that I have learned and explained the Regular Expressions that Google Analytics uses:

Dots .
Carats ^
Dollars signs $
Question marks ?
Pipes |
Parentheses ()
Square brackets []and dashes –
Plus signs +
Stars *
Regular Expressions for Google Analytics: Now let’s Practice
Bad Greed
RegEx and Good Greed
Intro to RegEx
Minimal Matching

I want to explore another area: Regular Expressions and the concept of greediness.

You might be tempted to write a Regular Expression like this:


expecting it to match the page on your site called /mypage/

And this Regular Expression really does match /mypage/. But it also matches /mypage/thirdpage-and-something-else . For that matter, it matches /secondpage/mypage.html and mypage.htm and mypage.asp.


Regular Expressions are greedy — they match and match as much as they can. Greed can be good, but first I want to write about the obvious problem, i.e. the RegEx (Regular Expression) will match too many strings to be useful.

We can deal with this in various ways:

1) Tell the RegEx where to start. In the above example, if we wrote the RegEx like this


it will only match when /mypage/ is at the beginning of the line, so it will never match /secondpage/mypage/ etc.

2) Tell the RegEx when to stop. We can do this in various ways, but need to know how the expression ends. For example, are we only looking for mypage.htm or are we also looking for all the pages that are in the /mypage/ folder — /mypage/otherpages.htm? If only mypage.htm matters, then we can include that in the RegEx:


Notice that I used a backslash to make the dot into a real dot and not a special character, and a dollar sign to say, this only works at the end of the line. (That way, mypage.html won’t match.)

We can combine that with #1 above and create this RegEx:


and never get any unexpected characters before the slash or after the htm.

On the other hand, if all the pages in the /mypage/ folder that have .htm suffixes are of interest, we could do this differently:


I really hate when people throw this kind of gobbledygook at me so let me see if I can explain in pieces:

^/mypage/ = only consider the match if /mypage/ is at the start of a line…
.* = match everything that comes next until…
.htm$ = you get to the last real period followed by htm and it’s at the end of a line.

Clear as mud, eh?