How To Filter Bot Traffic From Your Google Analytics

September 05, 2013
By Jim Gianoglio

UPDATE: July 30, 2014 – Google announced a feature to automatically filter out bots and spiders. Learn more here.

party crashers

Don’t let bad data crash your analytics party.

One of the benefits of client-side, tag-based analytics (as opposed to server side analytics) is that you generally don’t have to filter out traffic from bots.

However, it seems lately that some bots (*cough, cough* Microsoft) have been showing up in Google Analytics like an uninvited guest, crashing the data party.

For example, this is the graph of traffic to a site showing visits from a Bing bot:

Visits from Bing Bot

Overall visits to this site increased (artificially) 80.5% on August 14 (see the big spike to the right) – with about 90,000 visits from Bing’s crawler.

This is bad for (at least) 2 main reasons:

  • Bot visits skew your data, artificially inflating visits and unique visitors, increasing bounce rate, and decreasing pages/visit, average visit duration, goal conversion rate, ecommerce conversion rate, etc.
  • Increases the negative side effects of sampled data in Google Analytics. Even though the visits are from bots, They still count toward the visits when it comes to sampling.

Below, I will show you:

  • How to find out if you have this problem with bot traffic
  • How to get rid of the bot traffic from your reports

Uncovering Bot Traffic

You might be wondering if you have this problem in the first place. To find out if bots are crashing your data party, go to the Audience > Technology > Browser & OS report. The browser to look for is Mozilla Compatible Agent.

mozilla compatible agent

Now, just because the browser is Mozilla Compatible Agent doesn’t mean it’s a bot. There are other non-bots that use that user agent (some browsers in mobile apps, for example).

If you do have a problem with bot traffic, however, this is the canary in the coal mine.

If you see an unusually high number of visits from Mozilla Compatible Agent, you can go over to your Audience > Technology > Network report and apply this advanced segment (to show only visits where the Browser contains Mozilla Compatible Agent).

Look for visits from the following service providers:

  • microsoft corp
  • google inc.
  • yahoo! inc.
  • inktomi corporation
  • stumbleupon inc.

Also pay attention to the metrics – visits from bots will likely have close to 100% new visits, 100% bounce rate, 0o:oo:00 average visit duration, and 1 pages/visit.

Kicking out these uninvited guests

No one likes a party crasher, and you’ll want to kick them out quickly. The easiest way to do this is to create and apply a filter to your view (profile) that excludes based on the ISP Organization (i.e. Service Provider).

Set your filter up as follows:

Google Analytics filter to exclude smart bots

For the Filter Pattern, use the following regular expression:

^(microsoft corp(oration)?|inktomi corporation|yahoo! inc.|google inc.|stumbleupon inc.)$|gomez

This will take care of the main offenders. Of course, if you noticed other service providers in your data that look like bots that aren’t included in the filter above, be sure to include them!

Unfortunately, filters only apply to your data moving forward (not to historical data). So to remove these bot visits from your historical data, you’ll need to create an advanced segment (or just copy this one).

Making sure they never get through the front door

Unfortunately, even though you can filter these bots out of your data, they still count toward the total number of “visits” to your site from GA’s perspective. To put it another way, these bot visits can cause your data to be heavily sampled, even though you’re filtering them out.

Sampling happens at the web property level whenever there are more than 250,000* visits  for the selected date range and you request data that is not pre-calculated (when you apply an advanced segment, secondary  dimension, custom report, etc.).

* This sample size can be adjusted to a maximum of 500,000 visits

We’ve covered the problems with sampling (and how to work around it) before.

To get rid of this unwanted guest once and for all requires a more sophisticated solution, which involves modifying your Google Analtyics tracking code.

To give a high level view, you would need to wrap your tracking code in a function that checks whether the “visitor” is human or a bot; if they are human, execute the Google Analytics tracking code, else skip the tracking code altogether. To keep with the analogy, this would be like having a bouncer at the front door, only letting real visitors past the velvet rope and telling the bots to “take a hike!”

Are you interested?

If you’re interested in a specific solution for doing the above, let me know in the comments. If there’s enough interest, I’ll follow up with the code to do it.