Replicating The Google Analytics All Pages Report In BigQuery

June 01, 2018
By Becky West

While BigQuery is often the perfect tool for doing data science and machine learning with your Google Analytics data, it can sometimes be frustrating to query basic web analytics metrics. In this post, I’ll walk through calculating some fundamental metrics at the page level by replicating the All Pages report for the Google Merchandise Store in BigQuery.

I will be using Google’s sample dataset for the Google Merchandise store for August 1, 2016. You can compare this data to data in Google Analytics by viewing the Google Analytics demo account. I will be using the standard SQL dialect.

Word of warning: Although this post discusses the All Pages report – probably the most popular report in Google Analytics – it is not for the faint of heart. We will discuss a lot of technical details surrounding Google Analytics definitions, the SQL language, and the format of the BigQuery data. If you have not worked in BigQuery before, you may want to start off with a gentler introduction.

Primary Dimension – Page

In BigQuery the page dimension, or URL, is stored in the field hits.page.pagePath. Recall that Google Analytics data is stored at the session level. To access hit level information, we will need to unnest our table by hits. The query below shows all URLs that were visited by users.

SELECT
  hits.page.pagePath
FROM
  'bigquery-public-data.google_analytics_sample.ga_sessions_20160801' AS GA,
  UNNEST(GA.hits) AS hits
GROUP BY
  hits.page.pagePath

Pageviews

Next, we want to add in the number of pageviews associated with each URL. If you look at the Google Analytics schema, the only pageview related field you will see is totals.pageviews. Use extreme caution when using this field! It is a session level field (it’s counting the total number of pageviews in the entire session) and it should only be mixed with hit level fields with extreme care. This metric is not useful for replicating the All Pages report. Instead, we will need to calculate this metric ourselves.

So, what is a pageview and how do we calculate it? We often think of pageviews as the number of times that a page was loaded in the browser. In Google Analytics, this is measured by the number of times we send a “pageview” hit to Google Analytics. (Recall there are other types of hits that can be sent to Google Analytics, such as event, social, timing, and transaction hits.)

To calculate pageviews in Google Analytics, we need to count the number of times a hit of type PAGE is associated with each URL. As above, we will use the field hits.page.pagePath to identify the URL. Then filter the results by setting the hits.type field equal to ‘PAGE’ and count up the hits.

SELECT
  hits.page.pagePath,
  COUNT(*) AS pageviews
FROM
  'bigquery-public-data.google_analytics_sample.ga_sessions_20160801' AS GA,
  UNNEST(GA.hits) AS hits
WHERE
  hits.type = 'PAGE'
GROUP BY
  hits.page.pagePath
ORDER BY
  pageviews DESC

Unique Pageviews

In Google Analytics, unique pageviews represent the number of sessions during which a page was viewed. Once again, we want to focus on the hits.type equals ‘PAGE’, but rather than count the hits, now we want a count distinct on the number of sessions.

To count unique sessions, we need a unique way of identifying a session. The visitId field looks promising for this. However, this identifier has some issues when a session ends a midnight, so I recommend using visitStartTime instead. Note that the visitStartTime will only be unique per user, so we need to concatenate the visitStartTime with the fullVisitorId to get a globally unique session ID.

Using this session identifier, we can now compute unique pageviews.

SELECT
  pagepath,
  COUNT(*) AS pageviews,
  COUNT(DISTINCT session_id) AS unique_pageviews
FROM (
  SELECT
    hits.page.pagePath,
    CONCAT(fullVisitorId, CAST(visitStartTime AS STRING)) AS session_id
  FROM
    'bigquery-public-data.google_analytics_sample.ga_sessions_20160801' AS GA,
    UNNEST(GA.hits) AS hits
  WHERE
    hits.type = 'PAGE')
GROUP BY
  pagePath
ORDER BY
  pageviews DESC

Average Time on Page

The formula for average time on page is: Total Time on Page / (Pageviews - Exits).

We will need to compute each of the components of this formula separately, then compute the average using the formula above.

Pageviews

Easy – we have already calculated this above.

SELECT
  hits.page.pagePath,
  COUNT(*) AS pageviews
FROM
  'bigquery-public-data.google_analytics_sample.ga_sessions_20160801' AS GA,
  UNNEST(GA.hits) AS hits
WHERE
  hits.type = 'PAGE'
GROUP BY
  hits.page.pagePath
ORDER BY
  pageviews DESC

Exits

Fortunately, exit pages are designated with the hits.isExit field. We just need to count these up to get total exits on the page.

SELECT
  pagePath,
  SUM(exits) AS exits
FROM (
  SELECT
    hits.page.pagePath,
    CASE
      WHEN hits.isExit IS NOT NULL THEN 1
      ELSE 0
    END AS exits
  FROM
    'bigquery-public-data.google_analytics_sample.ga_sessions_20160801' AS GA,
    UNNEST(GA.hits) AS hits)
GROUP BY
  pagePath
ORDER BY
  exits DESC

Total Time on Page

This one is tricky. Recall that Google Analytics calculates the time on page by comparing the timestamps of hits sent in on different pages. For non-exit pageviews, the time on page is computed by subtracting the timestamp of the next pageview hit from the timestamp of the current pageview hit. For any pageview that is an exit, we compute the time on page as the difference between the timestamp of the last interaction hit minus the timestamp of the pageview hit.

If a page is an exit page and it’s the only page of a session, and there are no interaction events, then this page is considered a bounce and does not contribute to total time on page. Put more simply, bounces do not affect the avg time on page metric.

The timestamp information for each hit can be found in the hits.time field. Note that this field is measured in milliseconds, so you will need to divide the final time by 1,000 to get back to seconds. For each pageviews, we need to compute two additional columns – the timestamp for the next page as well as the timestamp for the last interaction hit.

We can compute the timestamp for the next pageview (within that session) by using a lead window function.

SELECT
  fullVisitorId,
  visitStartTime,
  pagePath,
  hit_time,
  LEAD(hit_time) OVER (PARTITION BY fullVisitorId, visitStartTime ORDER BY hit_time) AS next_pageview
FROM (
  SELECT
    fullVisitorId,
    visitStartTime,
    hits.page.pagePath,
    hits.time / 1000 AS hit_time
  FROM
    'bigquery-public-data.google_analytics_sample.ga_sessions_20160801' AS GA,
    UNNEST(GA.hits) AS hits
  WHERE
    hits.type = 'PAGE')

We can create a new column with the timestamp for the last interaction hit by finding the maximum timestamp out of all of the interaction events. Interaction hits are designated by the field hits.isInteraction.

SELECT
  fullVisitorId,
  visitStartTime,
  hits.page.pagePath,
  MAX(IF(hits.isInteraction IS NOT NULL,
      hits.time,
      0)) OVER (PARTITION BY fullVisitorId, visitStartTime) as last_interaction
FROM
  'bigquery-public-data.google_analytics_sample.ga_sessions_20160801',
  UNNEST(hits) AS hits
WHERE
  hits.type = 'PAGE'

Be careful when combining these two queries. Filtering in the where clause is a little tricky because we need to include events to get a last interaction hit value. Therefore, we need to be a little patient before calculating the next_pageview timestamp.

SELECT
  fullVisitorId,
  visitStartTime,
  pagePath,
  hit_time,
  type,
  isExit,
  last_interaction,
  LEAD(hit_time) OVER (PARTITION BY fullVisitorId, visitStartTime ORDER BY hit_time) AS next_pageview
FROM (
  SELECT
    fullVisitorId,
    visitStartTime,
    pagePath,
    hit_time,
    type,
    isExit,
    last_interaction
  FROM (
    SELECT
      fullVisitorId,
      visitStartTime,
      hits.page.pagePath,
      hits.type,
      hits.isExit,
      hits.time / 1000 AS hit_time,
      MAX(IF(hits.isInteraction IS NOT NULL,
          hits.time / 1000,
          0)) OVER (PARTITION BY fullVisitorId, visitStartTime) AS last_interaction
    FROM
      'bigquery-public-data.google_analytics_sample.ga_sessions_20160801' AS GA,
      UNNEST(GA.hits) AS hits)
  WHERE
    type = 'PAGE')

Now we can compute time on page by comparing these two columns to the hits.time of the pageview.

SELECT
  fullVisitorId,
  visitStartTime,
  pagePath,
  hit_time,
  type,
  isExit,
  CASE
    WHEN isExit IS NOT NULL THEN last_interaction - hit_time
    ELSE next_pageview - hit_time
  END AS time_on_page
FROM (
  SELECT
    fullVisitorId,
    visitStartTime,
    pagePath,
    hit_time,
    type,
    isExit,
    last_interaction,
    LEAD(hit_time) OVER (PARTITION BY fullVisitorId, visitStartTime ORDER BY hit_time) AS next_pageview
  FROM (
    SELECT
      fullVisitorId,
      visitStartTime,
      pagePath,
      hit_time,
      type,
      isExit,
      last_interaction
    FROM (
      SELECT
        fullVisitorId,
        visitStartTime,
        hits.page.pagePath,
        hits.type,
        hits.isExit,
        hits.time / 1000 AS hit_time,
        MAX(IF(hits.isInteraction IS NOT NULL,
            hits.time / 1000,
            0)) OVER (PARTITION BY fullVisitorId, visitStartTime) AS last_interaction
      FROM
        'bigquery-public-data.google_analytics_sample.ga_sessions_20160801' AS GA,
        UNNEST(GA.hits) AS hits)
    WHERE
      type = 'PAGE'))

Put It All Together

Finally, we are ready to aggregate and compute the average time on page.

SELECT
  pagePath,
  pageviews,
  exits,
  total_time_on_page,
  CASE
    WHEN pageviews = exits THEN 0
    ELSE total_time_on_page / (pageviews - exits)
  END AS avg_time_on_page
FROM (
  SELECT
    pagePath,
    COUNT(*) AS pageviews,
    SUM(IF(isExit IS NOT NULL,
        1,
        0)) AS exits,
    SUM(time_on_page) AS total_time_on_page
  FROM (
    SELECT
      fullVisitorId,
      visitStartTime,
      pagePath,
      hit_time,
      type,
      isExit,
      CASE
        WHEN isExit IS NOT NULL THEN last_interaction - hit_time
        ELSE next_pageview - hit_time
      END AS time_on_page
    FROM (
      SELECT
        fullVisitorId,
        visitStartTime,
        pagePath,
        hit_time,
        type,
        isExit,
        last_interaction,
        LEAD(hit_time) OVER (PARTITION BY fullVisitorId, visitStartTime ORDER BY hit_time) AS next_pageview
      FROM (
        SELECT
          fullVisitorId,
          visitStartTime,
          pagePath,
          hit_time,
          type,
          isExit,
          last_interaction
        FROM (
          SELECT
            fullVisitorId,
            visitStartTime,
            hits.page.pagePath,
            hits.type,
            hits.isExit,
            hits.time / 1000 AS hit_time,
            MAX(IF(hits.isInteraction IS NOT NULL,
                hits.time / 1000,
                0)) OVER (PARTITION BY fullVisitorId, visitStartTime) AS last_interaction
          FROM
            'bigquery-public-data.google_analytics_sample.ga_sessions_20160801' AS GA,
            UNNEST(GA.hits) AS hits)
        WHERE
          type = 'PAGE')))
  GROUP BY
    pagePath)
ORDER BY
  pageviews DESC

Entrances

Fortunately, entrances by page are easy to compute. There is a field called hits.isEntrance that we can use to determine whether that pageview is an entrance.

SELECT
  pagePath,
  SUM(entrances) AS entrances
FROM (
  SELECT
    hits.page.pagePath,
    CASE
      WHEN hits.isEntrance IS NOT NULL THEN 1
      ELSE 0
    END AS entrances
  FROM
    'bigquery-public-data.google_analytics_sample.ga_sessions_20160801' AS GA,
    UNNEST(GA.hits) AS hits)
GROUP BY
  pagePath
ORDER BY
  entrances DESC

Bounce Rate

Bounce rate is another complicated field in BigQuery. The formula for bounce rate is: Bounces / Sessions. We will need to compute bounces and sessions separately.

Bounces

Bounces are attributed to the first interaction hit in a session in which there is exactly one interaction event. We can determine if there was exactly one interaction event in the session by using the totals.bounces field. Now we just need to find the first interaction hit in the session.

The hits.hitNumber field will work well for this task. We will just need to use a window function to identify the hit number for the first interaction hit in the session. (Note that we can determine if a hit is an interaction hit by using the hits.isInteraction field.)

SELECT
  fullVisitorId,
  visitStartTime,
  pagePath,
  CASE
    WHEN hitNumber = first_interaction THEN bounces
    ELSE 0
  END AS bounces
FROM (
  SELECT
    fullVisitorId,
    visitStartTime,
    hits.page.pagePath,
    totals.bounces,
    hits.hitNumber,
    MIN(IF(hits.isInteraction IS NOT NULL,
        hits.hitNumber,
        0)) OVER (PARTITION BY fullVisitorId, visitStartTime) AS first_interaction
  FROM
    'bigquery-public-data.google_analytics_sample.ga_sessions_20160801' AS GA,
    UNNEST(GA.hits) AS hits)

Sessions

Sessions are attributed to the first hit (interaction or not) in a session where there is at least one interaction event. (Keep in mind that sessions are closely related to but not the same as entrances). Fortunately, we can use the totals.sessions field to identify whether there is at least one interaction event. Now we just need to identify the first hit in the session.

To find the first hit in the session, it is very tempting to just check if hits.hitNumber equals 1. Unfortunately, like visitId, this field does not restart at midnight. So, we will need to use a window function to identify the first hit number in the session. We can then use this column to compute the number of sessions associated with each page.

SELECT
  fullVisitorId,
  visitStartTime,
  pagePath,
  CASE
    WHEN hitNumber = first_hit THEN visits
    ELSE 0
  END AS sessions
FROM (
  SELECT
    fullVisitorId,
    visitStartTime,
    hits.page.pagePath,
    totals.visits,
    hits.hitNumber,
    MIN(hits.hitNumber) OVER (PARTITION BY fullVisitorId, visitStartTime) AS first_hit
  FROM
    'bigquery-public-data.google_analytics_sample.ga_sessions_20160801' AS GA,
    UNNEST(GA.hits) AS hits)

Aggregate and Combine

Now we just need to aggregate bounces and sessions then divide.

select
pagePath,
bounces,
sessions,
CASE
    WHEN sessions = 0 THEN 0
    ELSE bounces / sessions
  END AS bounce_rate
  from (
SELECT
  pagePath,
  SUM(bounces) AS bounces,
  SUM(sessions) AS sessions
FROM (
  SELECT
    fullVisitorId,
    visitStartTime,
    pagePath,
    CASE
      WHEN hitNumber = first_interaction THEN bounces
      ELSE 0
    END AS bounces,
    CASE
      WHEN hitNumber = first_hit THEN visits
      ELSE 0
    END AS sessions
  FROM (
    SELECT
      fullVisitorId,
      visitStartTime,
      hits.page.pagePath,
      totals.bounces,
      totals.visits,
      hits.hitNumber,
      MIN(IF(hits.isInteraction IS NOT NULL,
          hits.hitNumber,
          0)) OVER (PARTITION BY fullVisitorId, visitStartTime) AS first_interaction,
      MIN(hits.hitNumber) OVER (PARTITION BY fullVisitorId, visitStartTime) AS first_hit
    FROM
      'bigquery-public-data.google_analytics_sample.ga_sessions_20160801' AS GA,
      UNNEST(GA.hits) AS hits))
GROUP BY
  pagePath)
ORDER BY
  sessions DESC

% Exit

The formula for % Exit is: Exits / Pageviews.

We have already calculated pageviews and exits above. Now we just need to combine these into a single query.

SELECT
  pagePath,
  pageviews,
  exits,
  CASE
    WHEN pageviews = 0 THEN 0
    ELSE exits / pageviews
  END AS exit_rate
FROM (
  SELECT
    pagepath,
    COUNT(*) AS pageviews,
    SUM(exits) AS exits
  FROM (
    SELECT
      hits.page.pagePath,
      CASE
        WHEN hits.isExit IS NOT NULL THEN 1
        ELSE 0
      END AS exits
    FROM
      'bigquery-public-data.google_analytics_sample.ga_sessions_20160801' AS GA,
      UNNEST(GA.hits) AS hits
    WHERE
      hits.type = 'PAGE')
  GROUP BY
    pagePath)
ORDER BY
  pageviews DESC

Page Value

Calculating page value is beyond the scope of this post. In general, I would recommend using the Google Analytics Core Reporting API to pull the page value or create your own custom content scoring method.


Accessing standard Google Analytics metrics through BigQuery can be more painful and time-consuming than using the Core Reporting API. However, BigQuery can give you the power to add more advanced reporting and analysis to the standard Google Analytics reports, and adding in those standard metrics can help provide context to your analysis.