Cleaning Up URLs In Google Analytics

December 17, 2010
By Jonathan Weber,
Director of Data Platforms

We wrote recently about a visitor who asked about enforcing capitalization consistency in URLs in Google Analytics. This is a pretty common thing.

In fact, there are a variety of ways your URLs can show up in inconsistent ways in Google Analytics. What you should recognize is, whatever the URL of the page is in your browser is what Google Analytics records. Now, often your webserver treats slightly different URLs as exactly the same page (differences in capitalization, leaving off a trailing slash, and so on). So if multiple versions of a URL are, in actuality, the same page, we want to clean up those URLs in Google Analytics.

Default pages

Here’s one scenario: you look in your Top Content report and you see your home page in a couple of places, like this:

  • /
  • /index.php

Or something like that. (Remember the URLs we see here are only the part after the .com or .org or whatever. So one of these represents http://www.example.com/ and one represents http://www.example.com/index.php.) We know these are both really the home page, but seeing them as two separate URLs in this report isn’t very helpful. We have to add up the pageviews for both to see what the total number is.

Google Analytics gives us a simple way to fix this. It’s in your Profile Settings and it’s this one right here:

Here, we can just put in “index.php” as the default page. Now Google Analytics will just add “index.php” to any URL that ends in a slash. Tada!

Multiple default pages

That doesn’t work for every scenario, however. Consider this:

  • /
  • /home.php
  • /careers/
  • /careers/index.php

Well, we can’t go and use the “Default page” setting from above, because now there are multiple possibilities, depending on where we are in the site.

Or, for that matter, what if we like the nice, clean, trailing slash URLs and want to get rid of all the index.php?

Well, this you can do with a Search and Replace filter. The setup looks like this:

Notice that I’m searching for anything ending in “/index.php” (the dollar sign means “end with” in Regular Expressions). I’m replacing that with just the slash. In the example above with both “index.php” and “home.php”, I could just create two filters, one for each one. Once I’m done, for my data going forward, I just get the trailing-slash versions of the URLs.

Trailing Slashes

Here’s one more thing that can be a problem, and this one is really challenging:

  • /careers
  • /careers/
  • /careers/index.php

We’ve already solved the “index.php” problem. But notice we also have a problem with slashes. A lot of webservers automatically correct for this kind of thing with redirects. (Here’s some information on how to do it with Apache.) But if yours doesn’t, you can fix the data in Google Analytics (again, with filters).

This one is a little hard, because the patterns we want to match are kind of ambiguous. Here’s what I came up with, but chime in on the comments if you have a cleaner solution.

So, here’s the regular expression I used to match these URLs:

^(/[a-z0-9/_\-]*[^/])$

OK, so it starts with a slash (duh), then it contains 0 or more characters that are alphabetic (a-z), numeric (0-9), or an underscore or hyphen. (You may have to adjust a little if you have other characters in your URLs). Then it ends with a character that is NOT A SLASH. (That’s the important part.)

Why such a specific Regular Expression? Why not just:

[^/]$

That says “ends with any character that is not a slash”. Well, unfortunately that’s probably not specific enough. Because we might have pages like the following:

  • /careers/jobs.php
  • /careers/?search=web%20analyst

Notice that neither of those end in a slash, but they’re not the kind of URLs we want to end in a slash, either. So we need a regular expression that doesn’t match a few key characters (like the dot and question mark) that clue us in we don’t have just a directory name, but a full page in the URL.

So then the Advanced Filter just grabs the original part of the URL and appends a slash to it.

So that’s a variety of instances in which you have opportunities to clean up URLs and improve the data you have in Google Analytics.