Regular Expressions For GA, Bonus III: Lookahead

August 8, 2007

NOTE: GA no longer supports negative lookahead. 

I hope this will be the last installment, for some time to come, of my Regular Expressions (RegEx) for Google Analytics series. At the end of the post, I have finally threaded all the RegEx posts.

Tonight you can see some very cool (if you care) regular expressions for GA that look ahead and decide whether the match is allowed. Sort of, conditional match. There are two kinds of look aheads: negative lookahead (don’t match if) and positive lookahead (only match if …) Like braces, this is a ReGex that works in Google Analytics, but there is no documentation on it.

regular-expressions

Here’s an example that I had today. I was working with a GA site where they have membership, not customers. The site owners use the word members in some of their URIs, and they also use membermail and memberthis and memberthat and membertheother. And a whole lot of other memberexamples.

So let’s say that I needed to create a filter with a regular expression to include all those member URIs in GA, but I want to make sure that I don’t include membermail, since mail is in a special category in my marketing mix. Now we’re in a position to formulate the question really well: how do we include all the Request URIs that have the string member in them where that string is not followed by the string mail? In other words — don’t match member if it is followed by mail.

Steve gets the credit for this one. He suggested that GA might work with negative lookahead – the ability to combine regular expressions to say, “don’t match if it is followed by…” In our membermail example, the expression would be member(?!mail) .

The opening of the parenthesis, followed by a question mark, tells the RegEx engine, “Watch out, lookahead coming.” The exclamation point says, “And it’s a negative.” Combined, they mean, it’s a negative lookahead – don’t match the first part of the string if the second part is there. Don’t match member if it is followed by mail .

GA also handles positive lookahead. So if we want to match only membermail and not memberthis or memberthat and all the other uris with member, we can write our RegEx like this: member(?=mail) — in this case the open paren and the question mark do the same thing (“Watch out, lookahead coming,”) but the equal sign says, “And it’s a positive match.”

There is one last little fine point to wrap your head around, assuming you are not dizzy already. The lookahead string, or whatever you want to call the string mail in my example, is not part of the match. I know this sounds like gibberish, so let me give a last example. This one’s for all you RegEx fans. And for everyone on the Paris metro:

Example: Let’s say that I am doing a positive lookahead like this: member(?=s)hip. This means, match to member only if it is followed by an s, and then please match to the hip, too. . However, the string membership would not be a match. That seems a little ridiculous. After all, it is member, followed by an s, followed by hip, right?

Well, it doesn’t work that way. That’s because the s is only a condition. In the eyes of the RegEx engine, you sort of have a conditional regex that looks like this:

memberhip (notice, no s)

And, you are trying to match to

membership

It’s not a match, because the s isn’t part of the RegEx. It was just being used as a condition.

OK, we are done for tonight, and

Late note: A reader here, Alan, wrote a wonderful comment, whereby he shows other ways to implement and use the power of positive lookahead. You should read his comment, but the short version is, you can use positive lookahead to match if the “condition” is somewhere down the line. The “lookahead string” — the condition — doesn’t have to *immediately* follow the (?!) if you use the syntax that he figured out. (See, now you have to read his stuff….)

Here is the thread with all the RegEx posts.

Backslashes
Dots .
Carats ^
Dollars signs $
Question marks ?
Pipes |
Parentheses ()
Square brackets []and dashes –
Plus signs +
Stars *
Regular Expressions for Google Analytics: Now we will Practice
Bad Greed
RegEx and Good Greed
Intro to RegEx
{Braces}
Minimal Matching

— Robbin