My favorites | English | Sign in

More personalization in Google Friend Connect New!

Google Analytics

Sampling

Sampling in Google Analytics or in any web analytics software refers to the practice of selecting a subset of data from your website traffic. Sampling is widely used in statistical analysis because analyzing a subset of data gives similar results to analyzing all of the data (see Confidence Interval below). In addition, sampling speeds up processing for reports when the volume of data is so large as to slow down report queries.

If your website has many millions of pageviews per month, sampling the traffic data collection for your site means that you will get good report results in a reasonable amount of time. Even if your site collection is not sampled, certain types of reports will contain sampled results, due to the nature of the query. For more information, see the Wikipedia article on sampling.

This document describes the two kinds of sampling that can occur in Analytics, along with the way we report the accuracy of the sampled data for automatic report sampling.

  1. Sampling Data Collection
  2. Automatic Report Sampling
  3. Confidence Interval

Sampling Data Collection

You can sample data collection for your site traffic by modifing the tracking code snippet on your website. This type of sampling is referred to as client-side sampling. With this kind of sampling, a percentage of your website traffic is collected rather than all the traffic. You will likely only do this in the case where your website traffic generates excessive pageviews per month for your account. To enable client-side sampling, call the _setSampleRate() method in the code and provide a parameter to set the percentage of traffic you want sampled. In the example below, the sample rate is set to 80%.

pageTracker._setSampleRate("80"); //sets sampling rate to 80 percent

With Client-side sampling, Analytics uses a methodology designed to obtain a systematic distribution of unique visitors across your site. The number you provide represents the percentage of visitors by unique ID that will be included in your sample. In the example above, 80 percent of unique visitor IDs will be tracked.

With client-side sampling, reports still reflect the actual visitor behavior you need for accurate analysis, but with the added benefit of faster report generation.

Automatic Report Sampling

Regardless of whether you have traffic collection sampled, Analytics may examine only a sample of the data it has collected when calculating a report. This type of sampling is called report sampling. It occurs automatically when you query for report data that is not available in aggregate.

For example, suppose you query a Content Detail report for your top page, which received 80,000 pageviews over the past month. That information has been automatically compiled in the Analytics database, so the report can quickly display the actual pageview number. However, if you then query that same page for pageviews by browser, you are requesting data that has not automatically been compiled, which means that a special query is needed to do the calculation.

For such a query, Analytics retrieves the data from a set of user sessions. Analytics uses a sample set of 10,000 sessions and estimates the actual number from that sample. This enables Analytics to deliver timely reporting information for large data sets. If the number of sessions being retrieved for that time frame is 10,000 or fewer, no sampling is needed and the actual number is reported.

Analytics indicates that a report is sampled with a notification in yellow at the top of the screen. It provides further information about the sampled metrics, as described below.

Confidence Interval

For report queries that are sampled, Analytics reports have a notice at the top stating that the report is based on sampled data. When this occurs, Analytics uses an estimate method known in statistics as a confidence interval. In statistics, a confidence interval indicates a range of values which is likely to include the correct statistic.

It will help you understand the term confidence interval if you remember that it is distinct from the phrase confidence level. Confidence interval applies to the margin of error of the estimated number. Thus, the smaller the confidence interval, the smaller the margin of error, and the more accurate the number. A confidence interval of zero (0) means that the number is completely accurate and that there is no sampling or estimate involved. For more information on confidence intervals, see the Wikipedia article on confidence interval.

When sampling does occur and confidence intervals apply, an additional specification called confidence level is used to indicates the probability of accuracy, such as 90%, 95%, and 99% . Google Analytics uses a 95% confidence level.

For the metrics values themselves, Analytics reports the reliability of the estimate in three possible ways.

Level of Accuracy Displayed as Description
Completely Accurate The metric value with no additional decoration. In the case where the data is not sampled, the confidence interval is 0, so only the estimated metric is supplied for the report.
Accurate to within a given range The metric value, followed by a number indicating the range of possible deviation at a 95% level of confidence. For example, you might have a number of pageviews such as 80,000 followed by +/- 5%. The confidence interval is the 5% range of values between the upper and lower bounds. So, the range of values for 80,000 pageviews is between 76,000 and 84,000 (since 5% of 80,000 is 4,000).
Remember
: Smaller percentages indicate a more accurate estimate.
Unreliable estimate The metric value, followed by an asterisk (*). In some cases, the number of sessions for a query can be extremely high, which makes the sample set only a small percentage of the actual data for the site. For example, suppose the data set contained only one session where browser was Chrome, and you are querying for pageviews where the browser is Chrome. In that situation, the metrics reported for Chrome users would be limited to a single session. Thus there is so little data that the pageview number for Chrome users is followed by an asterisk.

For data sampling, keep in mind that the larger the data set being sampled, the more reliable the estimate, and vice versa.

 

 

Back to Top