What's Missing In The Google Analytics BigQuery Export Schema?

June 23, 2016
What's Missing In The Google Analytics BigQuery Export Schema?

Google Analytics (via reports or the API) typically deals with aggregated data, where metrics are already summed and averaged for you, and you can easily request a tabular report of (say) Sessions by Date. On the other hand, the Google Analytics BigQuery Export Schema contains a wealth of raw data.

While the raw data opens up infinite possibilities, it also means that most Google Analytics Metrics and some Dimensions are not included in the export. These pieces of information, such as ga:hasSocialSourceReferral, ga:channelGrouping, ga:daysSinceLastSession, ga:pagePathLevel1, ga:landingPagePath, ga:contentGroupXX, and ga:previousPagePath need to be computed.

Perhaps most importantly, the goals that we’ve configured inside of Google Analytics are not stored in BigQuery and will need to be computed from scratch.

Brief Recap of the BigQuery Schema

BigQuery is a structured, table-based SQL database. In the BigQuery export, each row represents a session. Inside each session is the hit, custom dimensions, and other information about the session and hits. Below is an illustration of some of the fields within the export.

BQ Rows

Note that the session-level custom dimensions hits are repeated within the session and how the hit-level custom dimensions are repeated within each hit; this is one of the special properties of BigQuery: repeated fields.

Also note how the custom dimensions, hits, and totals have named fields within them; this is another one of BigQuery’s special properties: nested records.

Recreating Metrics

Since no metrics are contained within BigQuery, let us first examine methods to compute them. If you’re familiar with Custom Reports or the Google Analytics API, then you’re familiar with the concept of having metrics computed for a group of dimensions. In BigQuery, and SQL at large, that concept translates into aggregates and GROUP BY.

A simple example is the number of sessions per day (metric=ga:sessions, dimension=ga:date). We need to aggregate the sessions in the export by counting them, GROUPed BY date.

SELECT date, 
SUM(totals.visits) AS sessions
FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
GROUP BY date
date sessions
120130910 63

SELECT date, Filters can be performed with a WHERE clause, when filtering data in the table, or HAVING, when filtering an aggregate value. To count only sessions with transactions, we can filter on totals.transactions.

SUM(totals.visits) AS sessions
FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
WHERE totals.transactions > 0
GROUP BY date
date sessions
120130910 16

HAVING is similar to a WHERE but works on aggregate values, e.g. metrics. Using the above example, we could find all days with more than 70 sessions. (Since the LondonCycleHelmet dataset is only a single day, the results are immediately useful.)

SELECT date, 
SUM(totals.visits) AS sessions
FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
GROUP BY date
HAVING sessions > 70
date sessions
Zero Results returned

Now let us consider a slightly more advanced example: computing the metric ga:percentNewSessions with dimensions of ga:medium, we can aggregate the number of sessions by a count of all and a count of new sessions, GROUPed BY trafficSource.medium.

(We, unfortunately, cannot aggregate by averaging the new sessions flag in the export, because it is set to NULL, instead of 0, if the session is not new; and aggregates ignore null values, i.e. average would return 1.)

SELECT trafficSource.medium, 
SUM(totals.newVisits) / SUM(totals.visits) AS percentNewSessions
FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
GROUP BY trafficSource.medium
trafficSource_medium percentNewSessions
referral 0.5714285714285714
organic 0.4782608695652174
cpc 0.6
(none) 0.7222222222222222

We can also compute multiple metrics at once. Additionally, we are not limited to the 7 dimensions and 10 metrics that the Google Analytics API limits us to.

SELECT
  CONCAT(trafficSource.source,"/",trafficSource.medium) AS sourceMedium,
  newSessions / sessions AS percentNewSessions,
  bounces / sessions AS bounceRate
FROM (
  SELECT
    trafficSource.source,
    trafficSource.medium,
    SUM(totals.visits) AS sessions,
    SUM(totals.newVisits) AS newSessions,
    SUM(totals.bounces) AS bounces,
  FROM
    [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
  GROUP BY
    trafficSource.source,
    trafficSource.medium)
sourceMedium percentNewSessions bounceRate
technologysauce.com/referral 0.5714285714285714 0.14285714285714285
google/organic 0.4782608695652174 0.21739130434782608
google/cpc 0.6 0.2
(direct)/(none) 0.7222222222222222 0.1111111111111111

Interesting New Metrics

Since we can leverage all of BigQuery against our raw data, we can compute metrics that dont’t exist in Google Analytics or the API, e.g. quantiles or bucketing/binning.

Here we’ll compute the 25%ile, median, and 75%ile of ga:sessionDuration (totals.timeOnSite in BigQuery). We’re also showing of the ability, and common usage pattern, of using a subquery to do a calculation, or convert data for usage later on in the query.

SELECT
  trafficSource.medium,
  SUM(totals.visits) AS sessions,
  -- 5 will give the min, 25%, 50%, 75%, max with 20% error
  -- the more buckets, the better the approximation (error = 1/number of buckets)
  -- at the cost of more computation
  -- QUANTILES returns all of the buckets, use NTH to extract the bucket you want
  NTH(2, QUANTILES(totals.timeOnSite, 5)) AS firstQuartile,
  NTH(3, QUANTILES(totals.timeOnSite, 5)) AS mean,
  NTH(3, QUANTILES(totals.timeOnSite, 5)) AS thirdQuartile
FROM (
  SELECT
    trafficSource.medium,
    totals.visits,
    -- Sessions with a single page view will have no time on site reported
    IF(totals.timeOnSite IS NULL, 0, totals.timeOnSite) AS totals.timeOnSite
  FROM
    [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910])
GROUP BY
  trafficSource.medium
trafficSource_medium sessions firstQuartile mean thirdQuartile
referral 7 7 20 20
organic 23 5 18 18
cpc 15 13 23 23
(none) 18 8 18 18

Now we’ll bucket/bin our data in preparation for a histogram.

SELECT
  trafficSource.medium,
  SUM(totals.visits) AS sessions,
  SUM(IF(timeOnSite_bucket5sec = 0,1,0)) AS B0_4,
  SUM(IF(timeOnSite_bucket5sec = 1,1,0)) AS B5_9,
  SUM(IF(timeOnSite_bucket5sec = 2,1,0)) AS B10_14,
  SUM(IF(timeOnSite_bucket5sec = 3,1,0)) AS B15_19,
  SUM(IF(timeOnSite_bucket5sec = 4,1,0)) AS B20_24,
  SUM(IF(timeOnSite_bucket5sec = 5,1,0)) AS B25_29,
  SUM(IF(timeOnSite_bucket5sec > 5,1,0)) AS B30_
FROM (
  SELECT
    trafficSource.medium,
    totals.visits,
    -- Sessions with a single page view will have no time on site reported
    FLOOR(IF(totals.timeOnSite IS NULL, 0, totals.timeOnSite) / 5) AS timeOnSite_bucket5sec
  FROM
    [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910])
GROUP BY
  trafficSource.medium
trafficSource_medium sessions B0_4 B5_9 B10_14 B15_19 B20_24 B25_29 B30_
referral 7 1 1 0 1 1 1 2
organic 23 5 3 2 2 2 1 8
cpc 15 3 0 1 1 3 1 6
(none) 18 2 4 1 2 1 0 8

Dimensions

Some dimensions are simple to compute, while others require more ingenuity. Often these require subqueries, like we have seen earlier.

Extracting Dimensions from Other Fields

ga:pagePathLevel1 is an example of an easy-to-extract dimension. It is the first segment of the ga:pagePath (hits.page.pagePath in BigQuery), which we can pull out using SPLIT and NTH.

SELECT
  hits.page.pagePath,
  NTH(1, SPLIT(hits.page.pagePath, '/')) AS pagePathLevel1,
  NTH(2, SPLIT(hits.page.pagePath, '/')) AS pagePathLevel2
FROM
  -- We need to explicitly FLATTEN otherwise SPLIT will not behave as expected
  FLATTEN([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910], hits.page.pagePath)
hits_page_pagePath pagePathLevel1 pagePathLevel2
/helmets/foldable.html helmets foldable.html
/ null null
/vests/ vests null
/vests/yellow.html vests yellow.html

However, we still need to bring along values we want to use later. For instance, if we wanted to compute the bounce rate per ga:pagePathLevel1, we should bring along totals.bounces.

SELECT
  pagePathLevel1,
  ROUND(SUM(totals.bounces) / SUM(totals.visits), 2) AS bounceRate
FROM (
  SELECT
    totals.visits,
    -- If no session bounced, then SUM of nulls is null, not 0
    IF(totals.bounces IS NULL, 0, 1) AS totals.bounces,
    NTH(1, SPLIT(hits.page.pagePath, '/')) AS pagePathLevel1,
  FROM
    -- We need to explicitly FLATTEN otherwise SPLIT will not behave as expected
    FLATTEN([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910], hits.page.pagePath))
GROUP BY
  pagePathLevel1
pagePathLevel1 bounceRate
helmets 0.07
null 0.09
vests 0.03
login.html 0.0

Beyond simple transformations on fields, sometimes we want to create “arbitrary” groupings. In the case of Content Groupings that can be based on data already in the export (e.g. URL or Custom Dimension), we can tease it out via a CASE statement, among other ways.

SELECT
  hits.page.pagePath,
  CASE 
    WHEN pagePathLevel1 = 'helmets' THEN "Products" 
    WHEN pagePathLevel1 = 'vests' THEN "Products"
    WHEN pagePathLevel1 IN ('basket.html', 'shipping.html', 'billing.html', 'confirm.html') THEN "Checkout"
    WHEN pagePathLevel1 IS NULL OR pagePathLevel1 = 'login.html' THEN "RestOfSite"
    ELSE 'Unknown'
  END AS contentGroup1,
  CASE 
    WHEN pagePathLevel2 = 'foldable.html' THEN "Foldable" 
    WHEN pagePathLevel2 IN ('orange.html', 'yellow.html') THEN "Color"
  END AS contentGroup2
FROM (
  SELECT
    fullVisitorId,
    visitID,
    hits.page.pagePath,
    NTH(1, SPLIT(hits.page.pagePath, '/')) AS pagePathLevel1,
    NTH(2, SPLIT(hits.page.pagePath, '/')) AS pagePathLevel2,
  FROM
    -- We need to explicitly FLATTEN otherwise SPLIT will not behave as expected
    FLATTEN([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910], hits.page.pagePath))
GROUP BY 
    hits.page.pagePath, 
    contentGroup1, 
    contentGroup2
hits_page_pagePath contentGroup1 contentGroup2
/helmets/foldable.html Products Foldable
/ RestOfSite null
/vests/ Products null
/vests/yellow.html Products Color
/vests/orange.html Products Color
/login.html RestOfSite null
/basket.html Checkout null

Slightly More Interesting Example

ga:landingPagePath is similarly easy to find using row-scoped aggregates; what that means is that for each row, we’re going to compute an aggregate based on a repeated value, in this case the hits.page.pagePath and hits.page.type. Since a session need not start with a pageview, we can’t simply grab all the hits with a hit.hitNumber equal to 1. To solve this problem, we’ll take the first hits.page.pagePath where hits.page.type is “PAGE”.

SELECT
  fullVisitorId,
  visitId,
  -- Uses the knowledge that hits are stored in chronological order
  FIRST(IF(hits.type = "PAGE", hits.page.pagePath, NULL)) WITHIN RECORD AS landingPagePath
FROM
  [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
fullVisitorId visitId landingPagePath
380066991751227408 1378805776 /helmets/foldable.html
712553853382222331 1378804218 /vests/
881288060286722202 1378803865 /helmets/
881288060286722202 1378804975 /

More Complex Example

For a metric like ga:daysSinceLastSession that requires knowledge of more than one session, in this case, a user’s previous session, we can use windowing functions.

Windowing functions allow us to compute a value for the current row given the value of other rows in the “window” of data we’re looking at. In the following example, LAG returns the row before the row being looked at as defined by PARTITION BY fullvisitorid ORDER BY visitStartTime ASC. PARTITION BY acts much like a GROUP BY clause, in that it creates new windows for each unique value of the fields listed. ORDER BY sorts the window according to the fields and order listed.

We can read LAG(visitStartTime) OVER (PARTITION BY fullvisitorid ORDER BY visitStartTime ASC) as “looking at only the sessions for a given user, ordered by visitStartTime, what was the visitStartTime previous to the the row I’m currently on”.

SELECT
  fullvisitorid,
  visitId,
  DATEDIFF(SEC_TO_TIMESTAMP(visitStartTime), SEC_TO_TIMESTAMP(prevVisitStartTime)) AS daysSinceLastSession,
  FLOOR((visitStartTime - prevVisitStartTime)/60) as minutesSinceLastSession
FROM (
  SELECT
    fullvisitorid,
    visitId,
    visitStartTime,
    LAG(visitStartTime) OVER (PARTITION BY fullvisitorid ORDER BY visitStartTime ASC) AS prevVisitStartTime
  FROM
    [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910])
fullvisitorid visitId daysSinceLastSession minutesSinceLastSession
3960256913998800485 1378805348 null null
4158255675143559999 1378818879 null null
4158255675143559999 1378819831 0 15.0

We are finding the visitStartTime of the previous session chronologically, and combining it with the visitStartTime of the current session to compute the days between the previous and current session. Above prevVisitStartTime will be NULL if there is no previous session, causing DATEDIFF return a NULL as well.

(The LondonCycleHelmet dataset is only of a single day, so we will only see NULL for the first session and 0 for the subsequent ones. That is why I also included minutesSinceLastSession to make the output a little more exciting!)

Even More Complicated Dimensions

Some dimensions, such as ga:channelGrouping and ga:hasSocialSourceReferral, require more computation that you can (or are willing to) write as a SQL statement. For this we leverage User Defined Functions, (UDFs) in BigQuery. UDFs are pieces of JavaScript that run in V8 on the same machine your data is on. For more information see my post, Self-Joins, Windowing, and User Defined Functions in BigQuery, or the BigQuery Documentation. UDFs take in a row of data, and return 0 or more rows.

In our case, we can use them to compute the Default Channel Grouping found in Google Analytics.

UDF:

function computeDefaultChannelGroup(row, output)
{
    if(   row.trafficSourceSource == '(direct)' 
     && (   row.trafficSourceMedium == '(not set)' 
         || row.trafficSourceMedium == '(none)')
    ) {
    output("Direct");
  }
  else if (row.trafficSourceMedium=="organic") {
    output("Organic");
  }
  else if (row.trafficSourceMedium=="referral") {
    output("Referral");
  }
  else if (row.trafficSourceMedium=="email")
  {
    output("Email");
  }
  else if ( (   row.trafficSourceMedium == "cpv"
             || row.trafficSourceMedium == "cpa"
             || row.trafficSourceMedium == "cpp"
             || row.trafficSourceMedium == "content-text")
  ) {
    output("Other Advertising");
  }
  else if ( ( row.trafficSourceMedium == "cpc"
           || row.trafficSourceMedium == "ppc"
           || row.trafficSourceMedium == "paidsearch")
           && row.adNetworkType != "Content"
  ) {
    output("Paid Search");
  }
  else if ( (   row.trafficSourceMedium == "social"
             || row.trafficSourceMedium == "social-network"
             || row.trafficSourceMedium == "social-media"
             || row.trafficSourceMedium == "sm"
             || row.trafficSourceMedium == "social network"
             || row.trafficSourceMedium == "social media")
           || false // Social Source Referral exactly match Yes
  ) {
    output("Social");
  }
  else if ( (   row.trafficSourceMedium == "display"
             || row.trafficSourceMedium == "cpm"
             || row.trafficSourceMedium == "banner")
           || row.adNetworkType == "Content"
  ) {
        output("Display");
  }
  else
  {
    output("Unknown");
  }
}
function defaultChannelGroup(row, emit)
{
    function output(channelGroup) {
        emit({channelGroup:channelGroup, 
                fullVisitorId: row.fullVisitorId, 
                visitId: row.visitId,
        });
    }
    computeDefaultChannelGroup(row, output);
}

bigquery.defineFunction(
  'defaultChannelGroup',                           // Name of the function exported to SQL
  ['trafficSourceMedium', 'trafficSourceSource', 'fullVisitorId', 'visitId','adNetworkType'],                    // Names of input columns
  [{'name': 'channelGroup', 'type': 'string'},  // Output schemac
   {'name': 'fullVisitorId', 'type': 'string'},
   {'name': 'visitId', 'type': 'integer'},
],
  defaultChannelGroup                       // Reference to JavaScript UDF
);

SQL:

SELECT
  fullVisitorId,
  visitId,
  channelGroup
FROM defaultChannelGroup(
  SELECT
    trafficSource.Medium AS trafficSourceMedium,
    trafficSource.Source AS trafficSourceSource,
    fullVisitorId,
    visitId,
    -- LCH is too old to have the needed field
    -- trafficSource.adwordsClickInfo.adNetworkType
    NULL AS adNetworkType
  FROM
    [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910])
fullVisitorId visitId channelGroup
380066991751227408 1378805776 Referral
712553853382222331 1378804218 Organic
881288060286722202 1378803865 Organic
881288060286722202 1378804975 Organic
881288060286722202 1378805870 Organic
1677140157296205498 1378803386 Paid Search

Note that above doesn’t take into account social referrers, as these are not in the export. A previous post provides a list we can use to build up social refers to check against.

Content Groupings, if too complicated to do with a CASE and/or REGEXP_EXTRACT, can also be computed via UDFs.

Wrap Up

Hopefully this post helps demystify what is and isn’t contained in the BigQuery export and gives enough examples and information on to compute missing dimensions and metrics from the export. BigQuery is an extremely flexible tool that can free your analysis from the constraints of the API, even if everything the API provides isn’t immediately available.