Regular Expressions Part XII: Now Let's Practice

November 27, 2006
By Robbin Steif

Now that I have learned and then explained all the Regular Expressions for Google Analytics:

Backslashes
Dots .
Carats ^
Dollars signs $
Question marks ?
Pipes |
Parentheses ()
Square brackets []and dashes –
Plus signs +
Stars *
Regular Expressions for Google Analytics: Now let’s Practice
Bad Greed
RegEx and Good Greed
Intro to RegEx
{Braces}
Minimal Matching

let’s work backwards, i.e. look at some expressons and figure out what they mean and why.

regular-expressions

I always hate when techies give a really simple explanation and then jump to the hardest example possible, so I will try not to do the same (which is easy for me, not being a techie and all.) Let’s start with these warm-up examples from the Wikipedia entry on Regular Expressions.

  • “.at” matches any three-character string like hat, cat or bat. Reason: Because a dot matches any character. So hat, cat and bat are all good matches, as would be any other one character match to the dot.
  • “[hc]at” matches hat and cat. Reason: Because square brackets create a list of items, and you can match to any one item in the list. So this expressions matches hat by pulling the “h” out of the square brackets, and it matches cat by pulling the “c” out of the square brackets, but unlike the former example, it doesn’t match bat — that’s because there is no “b” in the square brackets.
  • “[^b]at” matches all the matched strings from the regex “.at” except bat. Reason: This is an alternative of the carat ^ — when it is inside square brackets at the beginning, it means “not.” Thus, the [^b] means, don’t match a b.
  • “^[hc]at” matches hat and cat but only at the beginning of a line. Reason: This is a more standard use of the carat ^ — it is not inside square brackets so it means, the RegEx will match your expression only if your expression starts at the beginning of the line.
  • “[hc]at$” matches hat and cat but only at the end of a line. Reason: This is identical to the second example in this list, except for the dollar sign at the end. The dollars sign ensures that the RegEx only matches your string if your string’s characters come at the end of a line.

OK, here is a slightly harder one, also from Wikipedia:

((great )*grand)?((fa|mo)ther)

I will take this apart to make it easier to understand.

The parenthesis create groups, separated by a question mark. So we effectively have:

(expression in this set of parenthesis)?(another expression in this set of parenthesis)

Since a question mark usually means, include 0 or 1 of the former expression, we know that this RegEx is allowed to match just the stuff in the second set of parenthesis (right? That’s what question marks do, they can match what comes right before them or not match what comes right before them. If they don’t match the stuff before them, only the characters after them are left to match.) So, let’s start by looking at the second half only, which we know should be able to stand by itself:

((fa|mo)ther)

The pipe symbol | means OR. So this resolves to (father) OR (mother). You might reasonably ask, why do we need all the parentheses? Technically, we don’t need the outside set but they make the expression easier to read when it is all together like this: ((great )*grand)?((fa|mo)ther) It would be perfectly reasonable to write an expression like this: (fa|mo)ther. We do need the inside set because if we got rid of them, the expression would look like this: fa|mother , which means, either fa OR mother.

Now let’s go back and look at the first half, the part that came before the question mark:

((great )*grand)

The star tells us to match zero, one or more than one instances of the expression before it. So it can match a string which doesn’t include great, in which case we just have grand, and of course, we always have the end of the expression, which will either be mother or father. So we might match to grandmother or grandfather. It can match a string which includes great just once, in which case we have great grand mother OR great grandfather. And it can match a string which includes great more than once, so we might end up with great great great great grandmother OR great great grandfather.

So there you have it, all your ancestors with just one Regular Expression.

If you didn’t understand any of that, please send me email, steif -at- www.lunametrics.com. (I am always disappointed that no one ever comments on RegEx posts.)

Robbin