Google Analytics is a very powerful tool, but I think we’d all agree it might be a bit much to expect it to process monster amounts of data on-demand and return reports instantaneously without a tiny tradeoff. Some heavier trafficked sites expecting instant reports from their oceans and oceans of data (which obviously take time to generate) instead find themselves running into sampling issues in their reports – where GA is forced to make it’s calculations based on a smaller sample size of the overall data to get a report instantly. The problem is, sometimes that sample might not be statistically significant or sufficiently representative of the data, so any insights contained in the data aren’t…well…accurate.
In general, sampling isn’t an issue if all you’re looking at are the standard out-of-box reports, because they are all unsampled. However, when leveraging GA’s segmentation capabilities (which is where the real beauty of deep insights resides), whenever a data set is greater than 250,000 or 500,000 sessions within a selected time period, sampling might come into play.
Sampled data is just that….it’s sampled. It’s not fully representative of the actual data. While Google Analytics has an intelligent algorithm to ensure that sampling minimizes adverse effects on the data, the reality is that a dataset that is a 5% sample of your actual data really isn’t usable. How you determine what is usable and what is not, really depends on the nature of your data, and the type of analysis being performed, but in general, it’s best to keep the sample size as high as possible. These reports will undoubtedly be used as a reference point for marketing decisions, so it’s important that they’re accurate and provide actionable insights.
Is the Core Reporting API a solution to this dilemma? Not entirely. Sampling isn’t solved with just this API, even if you have GA Premium because the API has the same sampling thresholds applied to it as GA standard.
So what to do?
Hold tight, the following are 4 solutions to help you get clean data and clean insights again!
The first solution is to reduce the date range. When looking at a report (for which you’ve met or crossed the sampling threshold), the interface displays that the report is being sampled. Instead of looking at the entire month all at once, it may help to look at a smaller timeframe, such as a week. This way, only a subset of the data is being viewed and thus, the report that is pulled contains less sessions, which keeps us under the sampling threshold. You would have to look at subsequent weeks one at a time, which is a bit mundane, but once this is done, you can aggregate this and other date ranges of the same report outside of GA, into a single report. Read onto the next solution find out what to do with all those reports.
Note: The only way to export unsampled data directly is to be a Google Analytics Premium customer. There are some third-party tools available for non-GA premium users discussed below. These tools are designed to reduce but not eliminate the effects of sampling.
One way of reducing the effects of sampling is to break up the timeframe into smaller timeframes. For example, a year of data can be pulled as 12 separate months (12 separate queries), or a month of data can be pulled as 4 separate weeks (4 separate queries). For example, instead of pulling data for all of 2014, I can pull Jan. 2014, and then pull Feb. 2014, and so on. Obviously, we all have better things to do… A featured called query partitioning, available in tools such as ShufflePoint and Analytics Canvas (more details below), does the above for you in an automated fashion. The tools partition the query and programmatically loop through the desired timeframe, aggregating the report back together once done. This way, when you pull the report, the tool would appear as if making one query but in reality it’s making the number of queries behind the scenes, based on how you configure granularly you define the query partitioning. It may take some experimenting to find a balance between speed and accuracy (sample size).
More detail about the tools:
- ShufflePoint has a drag-and-drop interface that supports Google Analytics and a few other Google products. The nice thing about ShufflePoint is that it uses Excel’s web-querying capability, so you can write SQL-like queries to retrieve your data, make built-in calculations and display the data essentially any way you want.
- Analytics Canvas is another tool which allows you to connect to the Google Analytics API without coding. Analytics Canvas uses a “canvas” in which you can construct a visual flowchart of the query and subsequent transformations and joins of your data,to show what series of modifications will take place. It also allows for automating data extraction from BigQuery. If you using Google Sheets for your data, Analytics Canvas has an add-on in Chrome that allows you to create dashboards within Sheets.
Both of these tools have the functionality of extracting your data from Google Analytics and analyzing and creating reports.
If you are a Google Analytics Premium user, you can download unsampled reports (you will have to export them). Google Analytics just announced an exciting new feature available on Premium accounts called Custom Tables which allows you to create a custom table with metrics and dimensions of your choice (although there are some limitations). In other words, you can essentially designate a report that would otherwise be sampled, as a “Custom Table” which is then available to you as an unsampled report, similar to the out-of-box reports. You can create up to a 100 Custom Tables. This is awesome because you won’t have to worry about the sampled data for the reports you use often.
If you have Google Premium, it integrates with Google BigQuery which allows for moving massive datasets and super fast SQL-like querying. It works over the Google cloud infrastructure and is able to process data in the order of billions of rows. GA Premium allows for your data to be exported daily into BigQuery. In the Core Reporting API, the data is sampled at the same threshold as in GA Standard. BigQuery allows you to access unsampled hit level data instead of the aggregate level data within the user interface, which in turn opens doors for very powerful and previously impossible analysis!
Here is an examples of the type of analysis possible with BigQuery to help illustrate its use.
- What is the average amount of money spent by users per visit?
- What is the sequence of hits (sequence of clicks, pages, events, etc)?
- What other products are purchased by those customers who purchased a specific product?
For more details, visit here.
There you have it, four solutions to help you deal with sampling! Happy Reporting!