Querying Google Analytics Data In BigQuery

February 11, 2016
Querying Google Analytics Data In BigQuery

BigQuery, a database designed to query massive datasets in parallel using an SQL-like language, is a member of the Google Cloud Platform. What makes BigQuery interesting for Google Analytics users, specifically Premium customers, is that Google can dump raw Google Analytics data into BigQuery daily. While this enables many types of analysis that can’t be performed within the Google Analytics interface, it also doesn’t provide any basic metrics, e.g. bounce rate, to use.

How to Query BigQuery

The BigQuery web interface is a convenient way to test queries. Selecting the google.com:analytics-bigquery:LondonCycleHelmet dataset from the left pane will show two tables, ga_sessions_20130910 and refunds_201309. The ga_sessions_ prefix is given to all Google Analytics data dumps, and the 20130910 suffix is the date of the data in the dump. The refunds table is application specific and not generated by Google. (We’ll deal with the refunds table in another post!)

BigQueryEditor

Clicking on either table will show its schema in the main pane. Clicking the “Query Table” button in the top right corner will present the code editor. Clicking fields will place them in the query editor. Sample queries are placed in this post to help you get started.

How is Data Stored in BigQuery

Each row in the Google Analytics BigQuery dump represents a single session and contains many fields, some of which can be repeated and nested, such as the hits, which contains a repeated set of fields within it representing the page views and events during the session, and custom dimensions, which is a single, repeated field . (This is one of the main differences between BigQuery and a normal database.)

The Export Schema contains many interesting fields and all of the raw data Google collects. This raw data can be used to recreate all of the metrics and reports you’ve come to love in Google Analytics (Although they may not be as pretty!), in addition to many other interesting metrics, such as the likelihood a bounced user’s next session won’t bounce.

Quick Introduction to SQL

BQSQL is a fairly readable language (and very similar to ANSI SQL found in other databases like PostgreSQL and MySQL) and you should be able to understand and reason about it. A quick and dirty way to remember the main (BigQuery SQL) BQSQL clauses is the following sentence.

SELECT some fields

FROM some datasource

WHERE the row is interesting

GROUP BY some fields I want to be unique

HAVING some aggregate I find interesting

ORDER BY some fields

LIMIT to the top so many

Throughout this and later blog posts, I will help familiarize the reader with BQSQL through actual usage. Similar to learning a foreign language through immersion, I’ll try to explain complexity as it arises and hope that the reader will “pick up” feel of the language as we go.

When doing analytics work, we often want to perform aggregate queries. These are queries that aggregate multiple rows of data into a single value. For example, think count the number of sessions generated by organic searches: we don’t want to know anything about each individual session, we just want the total count. Similarly for average revenue. A list of aggregate functions can be found with the BigQuery SQL documentation.

Example: Count the Number of Sessions Generated by Organic Searches

Note, for the examples in this articles will be using the example “London Cycle Helmet” dataset provided by Google. It is located in the google.com:analytics-bigquery project and the LondonCycleHelmet dataset.

By default, aggregate queries will aggregate over the entire dataset. A simple COUNT can be executed on any field and will count non-NULL fields. The WHERE clause allows you to define conditions to filter your data with.

SELECT COUNT(totals.visits) AS visitCount
FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
WHERE trafficSource.medium = "organic"

Here, we only want to count sessions where the traffic source’s medium was “organic.”

Example: Bounce Rate

Sessions that have bounced are marked in the totals.bounces field with a 1 and sessions that didn’t bounce with a NULL. The bounce rate is the total number of sessions that bounced divided by the total number of sessions. While the AVERAGE function sounds like it would be what we’re looking for, it (like all aggregates) ignores nulls which would make the average 1! Instead, we need to simply COUNT the bounces and total sessions. (Since COUNT also ignores NULL it only counts the bounces.)

SELECT COUNT(totals.bounces)/COUNT(totals.visits) AS bounceRate
FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]

Here, we COUNT the bounces and total sessions, divide those values, and then name the result of the calculation bounceRate. The fields used come FROM the sample table.

Example: Bounce Rate by Day or Hour

We could GROUP BY date and compute the bounce rate for each day instead of over all tables. (Sad note: the London Cycle Helmet dataset is only a single day. We will instead compute the bounce rate per hour instead.)

SELECT HOUR(SEC_TO_TIMESTAMP(visitStartTime)) AS sessionHour,
       COUNT(totals.bounces)/COUNT(totals.visits) AS bounceRate
FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
GROUP BY sessionHour

Here, we use SEC_TO_TIMESTAMP to convert the visitStartTime field, which stores the time the session started as a UNIX Timestamp, to a type that the rest of the date-time functions in BigQuery, here HOUR, understand.

Also note that we moved the bounceRate calculation onto a new line and indented it a little. This is purely for aesthetic reasons and all whitespace (spaces and line breaks) are collapsed into a single whitespace: this means you can put statements on new lines, indent as needed, etc to make your code readable and understandable. More over, you can use -- to add comments to code as such. Any after a -- and on the same line is treated as a comment and ignored.

Commenting every line is often unnecessary, but well placed and informative comments will save a lot of time trying to figure out why a query was written the way it was originally. A rule of thumb I use is that if I needed to spend time thinking of how to write a part of a query, that immediately gets a comment about why it was written the way it was. Another rule of thumb is if I wrote a query, found it to be incorrect and had to fix it because of something I wasn’t expecting or thought of, then it also gets a comment. (Think about AVERAGE ignoring NULL. I know that now, but a new person reading my query later may not.)

-- Per-hour bounce rate calculation
SELECT HOUR(SEC_TO_TIMESTAMP(visitStartTime)) AS sessionHour, -- Extracts the hour from the unix timestamp
       -- COUNT is an aggregate function and will only count non-NULL values
       COUNT(totals.bounces)/COUNT(totals.visits) AS bounceRate -- Computes the bounce rate. Don't use AVERAGE! it ignores NULLs
-- Let's just use the one example table
FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
GROUP BY sessionHour

Additionally, the capitalization used above (keywords in all caps and fields in CamelCase is only a convention to make reading easier. Capitalization doesn’t matter, but adhering to the convention is highly recommended and encouraged.

Example: Bounce Rates by Traffic Source Medium

Knowing that GROUP BY works on any field, or group of fields, in the table or defined in the SELECT clause is very powerful, especially when thinking about “some statistic” per “some value in a grouping”.

SELECT trafficSource.medium AS medium,
       COUNT(totals.bounces)/COUNT(totals.visits) AS bounceRate
FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
GROUP BY medium

Now, let’s get wild and group by multiple fields!

SELECT trafficSource.medium AS medium,
       HOUR(SEC_TO_TIMESTAMP(visitStartTime)) AS sessionHour,
       COUNT(totals.bounces)/COUNT(totals.visits) AS bounceRate
FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
GROUP BY medium, sessionHour

“But!” you’re going to tell me, “they’re all in a random order!” and you would be correct! We need to ORDER BY some field. Let’s do sessionHour first.

SELECT trafficSource.medium AS medium,
       HOUR(SEC_TO_TIMESTAMP(visitStartTime)) AS sessionHour,
       COUNT(totals.bounces)/COUNT(totals.visits) AS bounceRate
FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
GROUP BY medium, sessionHour
ORDER BY sessionHour, medium

Example: Mediums with the Most Sessions per Hour

Now that you have the inspiration juices flowing and feel like you can conquer any problem with just enough SQL, you begin to think “Hey, which mediums bring in the most people by hour!”

SELECT trafficSource.medium AS medium,
       HOUR(SEC_TO_TIMESTAMP(visitStartTime)) AS sessionHour,
       COUNT(totals.visits) AS sessionCount,
       COUNT(totals.bounces)/COUNT(totals.visits) AS bounceRate
FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
GROUP BY medium, sessionHour
ORDER BY sessionHour, sessionCount DESC

Going Further

You might begin to see the types of questions that could be answered with BigQuery that you can’t in Google Analytics, for instance more detailed drill downs into user-based metrics. We’ll save a discussion of those for our next post however!

In addition to future blog posts on using BigQuery and other tools to analyze your data, the book Practical Google Analytics and Google Tag Manager for Developers recently published by Jonathan Weber and the rest of the team here has an entire chapter on learning and using BigQuery with Google Analytics! (In addition to a plethora of good information and best practices when using Google Tag Manager.)