Most of the time, web analytics tools like Google Analytics generate data about your website that’s about as accurate as you’re likely to get from any web analytics package. However, once your site starts to get more than a certain amount of traffic, this accuracy breaks down and your reports could in some cases start to paint a misleading picture of what’s going on. This happens because of data sampling, something that high traffic volume sites in particular need to be wary of.
What is data sampling?
I first learned about data sampling in my introductory stats classes at university. The idea is that, for one reason or another, it isn't always practical to measure something we’re interested in as completely as we’d like.
So, if I ask ‘an expert’ what the average height of women in the UK is, and he says ‘five foot eight’, I know that he probably didn’t get that figure by running around with a tape measure, travelcard and a checklist of all the women in the UK. Instead, our expert probably got that figure by saying, ‘right, let’s measure a representative sample of women in the UK, and assume their average height is pretty close to the actual average height of the whole population of women’. Simple right? The idea is that we figure out the properties of a population by examining a sample of that population.
And that, in a nutshell, is what sampling is. Social scientists, pollsters, statisticians and you (yes, you, GA user, even if you don’t realise it) use sampling all the time to draw conclusions about what big big groups of people are like based on what a small group of them does.
Why do analytics tools sample?
Processing your data costs money. Google and other analytics providers rack up massive power bills and burn out loads of servers every day just by virtue of having the hardware work so hard at processing terabytes of your website data.
Aside from that, there’s another problem which is that processing all your data takes time. Unless you have plenty of super fast servers (again, expensive), running a report using all the data for a decent sized site takes a long time, probably longer than most people are willing to spend waiting in front of an interface for a report to load.
So your analytics providers have this problem - how can they keep the costs of data processing low and deliver reports quickly, while still keeping the reports you receive accurate? The silver bullet here is sampling. When you do it well, sampling will give you an answer that’s accurate to within a few percentage points of the actual figure while still doing only a fraction of the work. Works for everyone, right?
When is sampling a problem?
Long story short, if you don’t sample well (yes, there are some pretty involved rules for doing sampling well), then you can end up with all sorts of misleading results. How and why? I will illustrate how bad sampling can happen with some graphs.
Let’s say the red box is the sample you take, and the bell curve is your data:
So let’s start with an easy scenario. You can actually sample all your data. This is the ideal situation. Here sampling won’t be a problem because you’re not actually taking a sample, you’re measuring every piece of data you have.
But let’s look at a different scenario. Say you can’t actually sample all your data, so you need to sample little chunks of it. When that happens, you want something like this to happen:
In this situation, we’re only looking at part of the data and extrapolating what the whole looks like based on these little snapshots. That isn’t so much of a problem for us though, because the snapshots are doing a pretty good job of capturing all the interesting parts of the dataset. In other words, the sample is pretty representative of all the data, and any conclusions we draw from the sample will reflect reality well.
But now we hit the problem scenario. Let’s say this happens:
This is not so good. Yes, we measured our data, maybe we measured just as much data as before, but we left out some really important parts that were different to everything else and would have changed our overall picture if we’d taken them into account.
It’s like going to a boring party and leaving before you realised that what was going on in the one room you didn’t visit was awesome. In other words, you didn’t take a representative sample of the party, and your final conclusion about how enjoyable it was was misleading. It’s the same thing with your website traffic.
Web analytics tools won’t sample this badly on purpose, they’re actually designed to be quite clever and avoid sampling badly where they can. But, if you have 100 million hits and a sampling limit of 10 hits, then no algorithm is clever enough to overcome the limitations of small samples. Unfortunately, there comes a point where you just need to apply more brute force (i.e. more processing power) to the problem to deal with it effectively.
Tip: To a rough approximation you can guess the kinds of sites that will suffer most and least from sampling. Simple sites with limited content and a small number of traffic sources are least vulnerable, whereas huge sites with great wobs of different and rapidly rotating content (e.g. big ecommerce sites and news sites) which have really complicated and elaborate traffic profiles will tend to suffer the worst. Big sites that do a lot of seasonal business are also disproportionately vulnerable to sampling unless they analyse their data in small chunks.
That was very abstract. Can we have an example?
Yes you can. If you understand how and when GA samples, and you’re good with Excel, you can actually see how much of a difference sampling is making to your reports. This is a sampled versus unsampled report from one of our bigger clients whose data we’ve anonymised. The s columns show you the results from the sampled report, the us columns show you the results when that same report was unsampled. The really interesting parts are the % difference columns. Those tell you the percentage by which the sampled reports were off. Have a quick look, and you can see that the sampling’s doing a reasonable job for some traffic sources, but with others it’s all over the place.
Does this mean that Google Analytics is a bad web analytics tool?
No! Absolutely not. The free version of GA is designed - and works really well - for most websites. Once visits for a given time period surpass 0.5 million you get into a situation where sampling may be applied in certain circumstances. When you start to get over 1 million then sampling might start to be a problem.
The problem arises because, to save resources, GA has a hard upward limit on the amount of your data it will sample to make a report. It will still provide you with unsampled data for standard reports (the ones you see in the interface as soon as you open it up), but it will start to use sampled data when you do one of these 3 things:
- Add a secondary dimension to reports
If you’re not familiar enough with GA to know what these features are, they’re all different ways to splice and slice your data in the GA interface (or the API for that matter). In my experience it’s very difficult to analyse your data in much depth or do anything tailored without using one of these three features. So from a data accuracy point of view, standard reports are fine, but any reports that you’ve cobbled together for custom analysis or reporting will start to suffer the effects of sampling.
How can you tell if data sampling is affecting your numbers in the GA interface?
Without outlining how to do a lot of the involved and detailed Excel jiggery I used to create that table, there are a couple of quick and dirty giveaways in the Google Analytics interface that suggest you have a problem with sampling.
1. You see this little yellow box appear in GA reports when you apply a segment, and it’s consistently reporting that it’s using a small amount of your data to generate reports (roughly 15% or less).
The GA interface was recently upgraded to let you know when and to what extent reports were being sampled. You can even control the size of your sample with the little checkerboard button. But there's still a fixed upward limit on how big you can make the sample, and even the biggest sample it will let you use is not big enough for many sites that would have issues with sampling anyway.
2. You often find that GA says different things to other tracking systems you use.
This is a good sign that you’re having sampling issues with your data. If you have lots of site tracking tools and all of them are correctly configured, but some of them frequently disagree with what GA’s saying despite them measuring the same things, that’s a dead giveaway that you’re probably getting a lot of errors in your sampled GA reports because your sampling limit isn’t adequate given the size of the query. As ever, comparisons with multiple tools is difficult so apply this logic with caution.
3. Is yours a pretty high volume site anyway?
Just having a lot of data doesn’t automatically mean you have a problem with sampling, but it is pretty much only large sites that will ever have a significant problem with it. So, if you are a bigger site, be aware that you should be looking into this. If you count monthly visit volumes in the millions rather than the thousands then we are talking about you.
How can I fix it if I have the problem?
There are a couple of technical workarounds you can use to avoid some of the negative effects of sampling, but they’re partial fixes, not permanent solutions.
Workarounds for sampling
Avoid splitting or segmenting data:
If you can get away with using standard reports in GA, then go for it. Those won’t be sampled till you do one of the 3 things specified earlier (segments, custom reports, secondary dimensions).
Look at smaller date ranges:
The way sampling works in GA means that the bigger the dataset you query in a report, the bigger the sampling inaccuracies. So, it logically follows that if you just look at a smaller date range you’ll actually make the data more accurate.
Be careful when you do this though that you don’t look at so little data that your conclusions aren’t valid because you accidentally cherrypicked some parts and not others (tip from an Analyst - it's easy to do this without realising it).
Use filtered profiles:
If you know in advance what reports you’re going to use and how you need them filtered, you can use profile filters to create pre-filtered GA profiles.
GA will start to sample at the point where you try to segment or customise reports in the interface. However, the data that is filtered at the profile level will be unsampled.
So, if you are looking at a report and apply a segment to show email only traffic, your data will be sampled.
However, if you create a filtered profile to only show email data, the reports will show you the email data, and be unsampled.
This is one of the best workarounds for having badly sampled data, however it’s not ideal. That’s mainly because it requires a lot of forward thinking, a pretty advanced knowledge of GA to set up, it consumes additional profiles very quickly (remember you can only have 200 per account), and the profiles you get will tend to be good for the one thing you filtered them for and that one thing only. Still, it’s a decent workaround for those who can deal with those issues, and it’s better than nothing for those who don’t have the budget for an enterprise level analytics package.
The sure fire way to fix sampling:
Although there are workarounds you can use for sampling in Google Analytics, the ones I’ve outlined above don’t really fix the problem and you’ll find them very limiting very quickly. Ultimately, the only sustainable fix for trying to process big volumes of data is to...
...upgrade your Analytics package to something like GA Premium.
Yes yes, I have a vested interest in telling you to buy bigger analytics products, but the truth is that if you want accurate analysis on really big quantities of data, you can’t rely on free tools intended for smaller sites.
Which is, in part, why GA premium was born in the first place. Don’t get me wrong, GA Premium is a very cool product with lots of good selling points, but we often find that the main thing clients want it for is the much more accurate data for their very big sites (that and the much faster data processing, which means you get fresh reports every 4 hours at most instead of every 24 hours at most - very important to a fast moving site with lots of content).
At the time of writing, GA premium has a data sampling limit 200 times that of standard GA (and it’s going up all the time). Big ecommerce sites would find no problem with those data limits at all, and a quick back-of-the-napkin calculation instantly tells you that there can only be a tiny, tiny handful of sites on the internet that would find even GA Premium’s data sampling limit inadequate for their needs.
And the end benefit of having the right sized analytics tool? The analyst/marketer/ecommerce director can analyse to their heart’s content and report accurate figures to their board without having to worry about whether or not the data they’re looking at is off by 80% or so.
<End of sales>
Want to find out more?
Get in touch with us if you’d like to talk to some professionals about sampling in your GA account and want to explore your options. If you call in, make sure to ask for the Analytics team.
Google’s official guidance on how and when GA samples reports:
Some advice from Avinash Kaushik (he’s the Silverback Gorilla of the web analytics world):
A short post by Justin Cutroni (one of Google’s official analytics evangelists):
Here’s a quick primer on the statistics behind sampling for those of you with a background in math and/or the curious: