Eliminating Bot Traffic From Google Analytics Once And For All

April 1, 2015 | Alex Moore
Eliminating Bot Traffic From Google Analytics Once And For All

If you’ve used Google Analytics, you’ve probably wanted to know: how much of our traffic actually comes from real human beings? In Google Analytics, it’s not always clear. It is reported that bot traffic now accounts for 56% of all traffic on a typical website. We would hope that most of this traffic is eliminated from Google Analytics. Is it?

Imperva Incapsula

It used to be simple: Bots would not process JavaScript. Google Analytics uses JavaScript. Therefore, bots would not show up in Google Analytics.

bot-keyboard

However, with the proliferation of jQuery, single-page web applications and dynamic ajax calls, smart bots have taken over. Now a bot can process JavaScript (and potentially Google Analytics), in a similar way to a real human’s web browser. And it’s a good thing too: if today’s search engine crawlers could not process JavaScript, much of the human-readable web would be hidden from search engines.

But, there are also evil smart bots: bots that go crawling for content they can scrape and use for their own nefarious gain. Some bots crawl web pages just to wreak havoc on web servers and increase costs for site owners.

Good search engine bots, generally speaking, will be excluded from Google Analytics automatically. They also follow directives that are outlined in a website’s robots.txt file, or in its meta tags, and crawl only the pages they’re supposed to. Good bots intentionally prevent requests from being sent to Google Analytics’ servers.

It’s the evil ones, the ones that break all the rules, and the ones that process JavaScript that we’re most concerned about. These “bad bots” account for a staggering 27% of all web traffic, at least, according to the same Incapsula study.

We need a solution to filter out bad bots from Google Analytics, so that we can confidently report on our data and know that the traffic totals, behavior statistics and conversion percentages we see are really from human beings.

What Can Be Done about it?

Our team has developed the ultimate step-by-step process to totally eliminate bots from Google Analytics, once and for all. Despite all the articles over the years surrounding this topic, we believe that no method has ever so thoroughly and systematically extinguished bot traffic.

In fact, we guarantee that not a single bot will be recorded inside your Google Analytics reports, if you follow each of these steps.

Step 1 – Check the Box in the Admin View Settings

admin-bot-filtering

There is an option (since July 2014) inside the Admin, under View Settings, to remove known bots from Google Analytics. Sayf Sharif wrote about this in the past:

DO NOT check this box on your main reporting View. He recommends creating a Test View first, so that you can see what impact this option will have on your data collection moving forward.

Google Analytics matches each visitors’ User Agent string to the list of known bots and spiders on the IAB/ABC International Spiders & Bots List. This is a paid list, so we’re not entirely sure which bots are included. (However, a Google search reveals this IAB list from 2013.)

Step 2 – Eliminate Bots by IP Address

While you can’t see IP addresses in Google Analytics reports, you can block IPs using Google Analytics View Filters. If you’re feeling adventurous, or if there is a pesky IP that is eating up your monthly quota of Google Analytics hits (10 million in the free version, 1 billion+ for GA Premium), you can follow Jon Meck’s instructions and block by IP address in Google Tag Manager, so those users aren’t even served Google Analytics code.

As a side-note, you may want to totally block bots from even visiting your website… especially if a particularly nasty bot keeps spamming your website or is engaging in a DDoS. To do so in an Apache environment, you can edit your website’s .htaccess file to block IP addresses from loading web resources, using this tool and perhaps a little regular expression trickery.

Step 3 – Eliminate Bots by User Agent

custom-dimension-user-agent

Of course, bots can switch between several IP addresses, making identifying them especially difficult. Enter Custom Dimensions. It is possible to pass each of your visitors’ User Agent strings into Google Analytics as a custom dimension, using Google Tag Manager. (You are using Google Tag Manager, aren’t you?) Then you can exclude all sessions based on User Agents that you know to be bots.

Begin by creating a Custom Dimension in Google Analytics, in the Admin. The Custom Dimension will be called “User Agent”, and we’ll set it at the Session scope. (Take note of the “Index” so we can refer to it in a moment.)

Next, in Google Tag Manager, we’ll create a new Variable. Dan Wilkerson explains how easy it is to retrieve the User Agent in Google Tag Manager. Simply create a JavaScript Variable with the value navigator.userAgent. See below:


user-agent-variable

The final step is to populate your Google Analytics Pageview tag with the custom variable slot, using the “Index” from earlier. Enter the the Variable {{User Agent}} for the Value.

tag-custom-dimension-user-agent

Now you’ll want to wait a day or two for your User Agents to begin to enter Google Analytics, so that you can identify which User Agents in particular are behaving like bots. Take note of users with repeat bounce rates, or users with hundreds of repeat visits in a single day, for example. Over time you can begin to write Google Analytics Filters to eliminate those User Agents, using the settings under Admin -> View -> Filters:

bot-filter

Step 4 – Require Completion of a CAPTCHA

google-recaptcha

Sometimes filtering by User Agent just isn’t specific enough, because some bots mask their true identity, or spoof themselves as regular users. We’ll need another solution.

In December, Google introduced a new version of their popular reCAPTCHA service, nicknamed “No CAPTCHA”. (reCAPTCHA was actually developed here in Pittsburgh!) This new version of reCAPCTHA is able to detect subtle cues based on typical human behavior — mouse usage, in particular — and in most cases eliminates the need for your human users to have to type in a CAPTCHA phrase at all!

When your visitors come to your website for the first time, you will display this reCAPTCHA — just follow the simple instructions in the documentation. In this case, you will NOT fire Google Analytics until the reCAPTHCA is completed successfully.

You’ll require each new user to complete the reCAPTCHA, and then your application will set a session cookie upon a successful completion. We are confident that the vast majority of your bot traffic will be eliminated from Google Analytics using this method. This is a good start, but for full protection, continue reading.

Step 5 – Confirm Your Users’ Email

email



After the successful CAPTCHA completion, your website must now present to your users a form, asking them to fill in their email address. The email must be valid. Upon entering an email, your website should display a message that reads, Thank you for entering your email. Please check your inbox for our confirmation email, and follow the instructions in that message. (You may change the messaging as you see fit.)

Your users will then need to check their email, and click on the confirmation link to access the website. This may take up to 24 hours, and some users’ spam filters will inevitably block the activation email, but this is the necessary cost of bot-free Google Analytics data.

Warning: Some bots can auto-verify emails sent as part of a confirmation process. The most advanced bots have even begun to register their own email addresses, which then become impossible to blocklist! That’s why you’ll need to require users to demonstrate their authenticity by answering a few more questions.

Step 6A – Answer a Math Problem

math-captcha

After the user verifies their email address, present them with a short mathematical CAPTCHA.

Step 6B – Find the Cat

recatcha-small



We call this step the reCATCHA. You will now present to your users a dialog box asking them to identify the cats in each picture. While some smart bots would be able to answer basic arithmetic, only the most sophisticated bots can identify the subtle differences between, say, cats and guinea pigs (bottom-center).

Step 7 – Magic Eye – Identify the 18th-Century Warship

One of the great challenges of eliminating bot traffic in Google Analytics is that, with each advancement in CAPTCHA technology, bot developers are expanding into once-inconceivable realms of human imitation. In order to ensure that your Google Analytics setup is built for the future, we strongly recommend that you present to your visitors a Magic Eye®, requiring them to type a description of the hidden image they see in the picture. See below:

magic-eye-sailboat-small

The answer here, of course, is “Man-of-war”. Sailboat, barque, ship of the line, galley, trireme, and schooner are incorrect responses. Only a robot would make such a mistake.

Step 8 – Enter Your Phone Number

two-step-auth-phone

Any thorough Google Analytics bot elimination strategy requires your visitors enter their phone number, to which you’ll automatically send a confirmation code for two-step authentication. (Mobile carrier rates may apply.) In addition, we recommend a random audit of these phone numbers, with rolling spot-checks where you and your team actually call as many of the numbers as you can.

In practice, we’ve learned that even the most advanced robots make computerized, emotionally devoid and soulless responses on the phone. The success of this tactic for bot elimination depends on your visitor demographic.

Step 9 – Enter Your Mailing Address

Bots that can process JavaScript, have a valid email address, can answer a series of CAPTCHAs, can interpret three-dimensional images, have a broad knowledge of 18th-century British nautical history, and possess a valid phone number would normally appear inside Google Analytics. Luckily, you will now require that your visitors enter a valid mailing address to receive a two-step authentication hardware token. (No PO boxes allowed.)

two-factor-auth

The token will update every 30 seconds, synced to a series of satellites in orbit. Your users will have to enter the code in order to proceed to the website. You can also fire a measurement protocol hit with each update from the satellite. Please allow 2-3 weeks for delivery. (You may also wish to invoice your visitors.)

Step 10 – Complete a Turing Test

turing-test

As it is theoretically possible that a bot would be able to correctly guess any two-step authentication code, a trained psychologist must now deliver a Turing Test to all remaining visitors, in order to determine if they are human. The test may be completed via Google Hangout or SurveyMonkey. Questions should include, “What did you come to this website to do?” and, “How disappointed would you be if you could not use this website?”

Step 11 – Find a Local Doctor

Your remaining visitors must now schedule a visit to a local physician’s office. The doctor must be a participating Google Analytics partner, in a Premium healthcare network only. Each visitor will be required to undertake a broad physical examination, to ensure that the visitor is actually a human being. A voluntary blood sample should be extracted and sent for additional laboratory testing.

Another blood sample will be sent to your company office to be recorded as a Custom Dimension, called “Blood Type”. We recommend setting this dimension at the “User” scope.

Step 12 – “Why I Am a Human” Essay

humanity-small



Very few bots will remain at this stage. However, in order to ensure that Google Analytics is completely, undeniably free of all bot traffic, you must require your remaining users to write a 10,000-word essay, entitled, “Why I Am a Human: A Retrospective.” This essay will be reviewed by a panel of dignitaries: academics, scientists, theologians, philosophers and poets, across a broad spectrum of humanity.

After weeks, or perhaps months, of careful deliberation, heated debate, and thoughtful reflection, the panel will deliver its ultimate judgement: is the visitor to your website truly human?

If so, then you should enable Google Analytics.

PLEASE NOTE: This may affect your conversion rate.

Start the Process of Verifying Your Humanity

Begin by filling out our CAPTCHA below:

recaptcha fools

You may also be interested in the following articles:

Don’t forget to check out our (real) post on Bot and Spam filtering.