How To Solve Google Analytics Sampling: 8 Ways To Get More Data

June 24, 2013
What do Bald Eagles have to do with Data sampling? Nothing. they're just awesome.

What do Bald Eagles have to do with Data sampling? Nothing.

What is Sampling?

Sampling is a tried and true statistical technique. You see it every time you hear about a political poll, or anything like that. 68% of people prefer dogs to cats. 28% of Americans think that God created Doritos. They didn’t actually ASK everyone. They asked a small subset of people, that hopefully is large enough to make what they’re reporting on accurate. So if they’re saying “All Americans” they might interview 1,000 people out of the 300 million that live here, and pretend it’s a legitimate sample.

It also leads to all the political polls being different, and arguing about whose poll is better. This generally refers to what sort of sample are they taking. How many people are they asking, what’s their makeup, how many men, versus women, etc.

The basic idea is that you simply can’t ask 100% of people certain questions, so you ask a subset aka a sample, and that then represents the larger group.

How does it work in Google Analytics?

The default sample is 250,000 visits (not pageviews… visits. It’s session based), but you can normally adjust a slider on the page (located in the top right… it looks like a checkerbox grid) and have the sample be more precise (but take longer to process) and have it cover 500,000 visits. Whenever you’re looking in a Standard Report you’re going to see unsampled data. So if you were to go look at All your Traffic Sources, you’ll see the correct numbers. However if you are looking at a set of visits over 250k (or 500k if you up your limit) and you want to create a custom report, filter the report in any way, add a segment, etc. Then you’re going to be looking at a sample.

How do I know if What I’m Seeing is Sampled?

google analytics sampling

Keep your eyes peeled for the yellow sampling box. It looks like a yellow box in the top right of your Google Analytics interface. It lets you know how may visits are part of the sample, and what percent that is. Keep in mind that this is the percent of the PROPERTY, not the Profile you’re in. It’s a sample from ALL the visits that come into that web property, regardless of what profile you are in. So even if the profile you are in shows only 50,000 visits over that time span you’re looking at, if the property as a whole has 20 websites in it, and each one tracks 50k in traffic, the property itself will have well over a million visits, and while you might be looking at a profile filtered to just show that single website, applying a segment to that 50,000 visits will require sampling to kick in at best at 50% if you have it on the highest precision.

Why is Sampling a Problem?

Well, it’s not always. If you’re in one profile, and it accounts for 99+% of the total traffic to the web property, and you apply a segment and you’re looking at an 80-90% sample, then it can be generally pretty accurate. You have to be aware that they’re not precise numbers, so if you need to pay authors based on pageviews of their articles, you definitely don’t want to do it based on a sampled report. However if you’re just looking at keywords or general trends an 80%+ sample rate is very usable.

At what point can you be comfortable? Well when you see particularly small numbers like under 1% you can be pretty sure that it’s junk data. Once the sample reaches a certain point, it’ll become obvious. You’ll see the same number repeated over and over and over. 330 visits on this campaign, 330 visits on that campaign, etc. But even before that point, the data can be very inaccurate.

data sampling gone wrong

With data under 10% we’ve seen huge swings. Once in a sampled report a premium client’s third party vendor was worried about the revenue from paid ads because they had generated “only” $900,000 in revenue on the site. When we looked at the unsampled data (rather than a less than 1% sample) however there had actually been instead 1.6 million in revenue from the ads, an 80% difference.

Even at higher sample rates we’ve noticed issues. Once we detected an upwards of a 10% overall change at a near 50% sample. A client site when compared month to month, year over year, showed a 5% increase, with a 48% sample. However, when we looked at the data unsampled, we were able to show that instead of improving by 5% it had actually decreased by 5%. That’s an issue.

Even though the math doesn’t line up, I once described the Google Analytics sample rate as akin to my personal confidence in the data. At 90% sample, I’m 90% confident that the data is close to correct. At 50% I’m 50% confident. At 1% sample, I’m 1% confident. I don’t generally base opinions on things when I’m as close to a coin flip in regards to my confidence.

What are My Alternatives?

1. Google Analytics Premium

Well, if you’ve got so much data that you’re sampling all the time, you might want to consider Google Analytics Premium. Premium brings a number of benefits, including the ability to export unsampled reports with up to 3 million rows of data (you can actually have more than 3 million rows, it’ll just aggregate the extra), and up to 100 million visits. If you’re constantly relying on sampled reports under 50% for your reporting, then you really might want to consider better data accuracy. Premium brings a number of other benefits, but when you’ve got so much data that you can’t do anything with your data on a quarterly, monthly, and/or especially daily basis, then you really want to consider it. (Disclosure: We’re a Google Analytics Premium Reseller)

Possible within GA Interface Right Now

2. Change Your Date Range

What date range are you looking at? If you’re trying to look at a 3 year period, then you might want to consider changing that to a smaller time span, to get the total visits to the property under that 500k level. If you are under 500k visits per month, then you could look month to month, and aggregate the data yourself in a spreadsheet, rather than use the Google Analytics interface.

3. Use Standard Reports

The standard reports in Google Analytics are never sampled. You’ll know this also because when you’re on a standard report you won’t see the yellow boxed sample message. Sometimes you might be applying a segment, or attempting to use a custom report, when you can get the same information from a standard report.

4. Create New Profiles with different filters

Sometimes you want to dig a little deeper in your reports than the standard reports allow. Maybe you want to look at the content reports for just your organic visitors, and applying the organic medium advanced segment forces you into a sample. You can create a new profile to just capture your organic traffic, and then put a filter on that profile to only allow organic traffic. If you apply any segments in the profile it’ll get sampled, but the standard reports will hold the unsampled information for just that organic traffic.

Requires Some Programming

5. Limit the # of sites tracking into the same Web Property

Another thing you can do is reduce the amount of traffic into that property. It is very common to aggregate all your tracking into a single account and property, and then look at the different websites by creating filters on various profiles, based on the hostname/domain. If you’re over the traffic level that generates sampled reports, you could consider breaking those websites out into different properties and tracking them separately. 20 websites with 30k visits a month will create sampled data on the monthly report, but if you were to break those 20 websites out into their own properties, it won’t even sample it when you look at a full year.

6. Set Sampling yourself to record less visits

You can set your own sample rate on your website via the _setSampleRate() method.

The thing to keep in mind here is that this samples what traffic gets sent into your Google Analytics account. so if you set your sample rate at 80, you’re going to be, by default, looking at sampled data in Google Analytics, even when it doesn’t even say it’s being sampled. It will let you use the GA interface without sampling happening on the server side, and instead be set automatically when visitors come to your site. I’m not a huge fan of this because it obfuscates that sample rate from the user of Google Analytics, but I’ll mention it here in fairness.

7. Server Side Solution with Alternative Tracking

Another interesting option is to have a second tracker used only with a certain subset of your visitors, delivered dynamically. You can create a second web property entirely, and then send different visitors different tracking code depending on who they are. Maybe you are a university, and you want to track your prospective students differently than your current students. As long as you’re able to identify these people, you can use a cookie to record which tracking code they should be sent. Your main tracker still gets sent out on every single page, but you only fire the second tracker on those specific users. Separating out your cohorts of visitors to different trackers will get you more accurate data with that second tracker because you’ll have smaller numbers. Maybe you have a million visitors a month, but your premium members which you recognize via a cookie you’ve set on their computer, only number about 20,000 per month. By sending that secondary tracker ONLY to your premium members, you now have a second web property that contains JUST those members’ visits, which you can analyze without sampling even over a whole year.

8. GA API

Another option is to hook into the Google Analytics API itself, and pull your data out to your own spreadsheets. You can make up to 50,000 requests per day to the API with 10,000 rows per request. Depending on the amount of data you have, this might require you to perform a number of different calls to Google Analytics via the API to capture all the data you need. In theory if you had a heavily sampled site, you could pull your data for every day of the month into a spreadsheet by making a different call (or more) for every day, depending on your data levels. This would give you the unsampled data in your own reports that you could use outside of the Google Analytics UI.

9. Analytics Canvas

http://www.analyticscanvas.com/

I’ve never personally used Analytics Canvas, but I’ve heard good things. Much of what they do involves using the Google Analytics API for reporting, and they can help companies looking to specifically use the GA API above, but where they pull their hair out for you, rather than you going bald yourself.

Requires an Invitation (for now)

10. BigQuery

Last but not least is Google BigQuery. It’s the new hotness.

https://developers.google.com/bigquery/

Do you want to analyze terabytes of data with the click of a button? Interactively analyze BILLIONS of rows of data? Sign up for the beta. It’s not open to the public yet, but if you’re interested in crunching some really big numbers, BigQuery is going to be an option that you might want to consider in the near future.

Wait…This blog post title says 8 ways to get more from your data, and you just listed 10

It’s a sampled title.

I still don’t understand why you had a picture of a bald eagle

Life is a mystery sometimes, isn’t it?