Querying Google Analytics Data In BigQuery
BigQuery, a database designed to query massive datasets in parallel using an SQL-like language, is a member of the Google Cloud Platform. What makes BigQuery interesting for Google Analytics users, specifically Premium customers, is that Google can dump raw Google Analytics data into BigQuery daily. While this enables many types of analysis that can’t be performed within the Google Analytics interface, it also doesn’t provide any basic metrics, e.g. bounce rate, to use.
How to Query BigQuery
The BigQuery web interface is a convenient way to test queries. Selecting the
google.com:analytics-bigquery:LondonCycleHelmet dataset from the left pane will show two tables,
ga_sessions_ prefix is given to all Google Analytics data dumps, and the
20130910 suffix is the date of the data in the dump. The refunds table is application specific and not generated by Google. (We’ll deal with the refunds table in another post!)
Clicking on either table will show its schema in the main pane. Clicking the “Query Table” button in the top right corner will present the code editor. Clicking fields will place them in the query editor. Sample queries are placed in this post to help you get started.
How is Data Stored in BigQuery
Each row in the Google Analytics BigQuery dump represents a single session and contains many fields, some of which can be repeated and nested, such as the hits, which contains a repeated set of fields within it representing the page views and events during the session, and custom dimensions, which is a single, repeated field . (This is one of the main differences between BigQuery and a normal database.)
The Export Schema contains many interesting fields and all of the raw data Google collects. This raw data can be used to recreate all of the metrics and reports you’ve come to love in Google Analytics (Although they may not be as pretty!), in addition to many other interesting metrics, such as the likelihood a bounced user’s next session won’t bounce.
Quick Introduction to SQL
BQSQL is a fairly readable language (and very similar to ANSI SQL found in other databases like PostgreSQL and MySQL) and you should be able to understand and reason about it. A quick and dirty way to remember the main (BigQuery SQL) BQSQL clauses is the following sentence.
SELECT some fields
FROM some datasource
WHERE the row is interesting
GROUP BY some fields I want to be unique
HAVING some aggregate I find interesting
ORDER BY some fields
LIMIT to the top so many
Throughout this and later blog posts, I will help familiarize the reader with BQSQL through actual usage. Similar to learning a foreign language through immersion, I’ll try to explain complexity as it arises and hope that the reader will “pick up” feel of the language as we go.
When doing analytics work, we often want to perform aggregate queries. These are queries that aggregate multiple rows of data into a single value. For example, think count the number of sessions generated by organic searches: we don’t want to know anything about each individual session, we just want the total count. Similarly for average revenue. A list of aggregate functions can be found with the BigQuery SQL documentation.
Example: Count the Number of Sessions Generated by Organic Searches
Note, for the examples in this articles will be using the example “London Cycle Helmet” dataset provided by Google. It is located in the
google.com:analytics-bigquery project and the
By default, aggregate queries will aggregate over the entire dataset. A simple
COUNT can be executed on any field and will count non-NULL fields. The
WHERE clause allows you to define conditions to filter your data with.
SELECT COUNT(totals.visits) AS visitCount FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] WHERE trafficSource.medium = "organic"
Here, we only want to count sessions where the traffic source’s medium was “organic.”
Example: Bounce Rate
Sessions that have bounced are marked in the
totals.bounces field with a
1 and sessions that didn’t bounce with a
NULL. The bounce rate is the total number of sessions that bounced divided by the total number of sessions. While the
AVERAGE function sounds like it would be what we’re looking for, it (like all aggregates) ignores nulls which would make the average 1! Instead, we need to simply
COUNT the bounces and total sessions. (Since
COUNT also ignores
NULL it only counts the bounces.)
SELECT COUNT(totals.bounces)/COUNT(totals.visits) AS bounceRate FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
COUNT the bounces and total sessions, divide those values, and then name the result of the calculation
bounceRate. The fields used come
FROM the sample table.
Example: Bounce Rate by Day or Hour
GROUP BY date and compute the bounce rate for each day instead of over all tables. (Sad note: the London Cycle Helmet dataset is only a single day. We will instead compute the bounce rate per hour instead.)
SELECT HOUR(SEC_TO_TIMESTAMP(visitStartTime)) AS sessionHour, COUNT(totals.bounces)/COUNT(totals.visits) AS bounceRate FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] GROUP BY sessionHour
Here, we use
SEC_TO_TIMESTAMP to convert the
visitStartTime field, which stores the time the session started as a UNIX Timestamp, to a type that the rest of the date-time functions in BigQuery, here
Also note that we moved the
bounceRate calculation onto a new line and indented it a little. This is purely for aesthetic reasons and all whitespace (spaces and line breaks) are collapsed into a single whitespace: this means you can put statements on new lines, indent as needed, etc to make your code readable and understandable. More over, you can use
-- to add comments to code as such. Any after a
-- and on the same line is treated as a comment and ignored.
Commenting every line is often unnecessary, but well placed and informative comments will save a lot of time trying to figure out why a query was written the way it was originally. A rule of thumb I use is that if I needed to spend time thinking of how to write a part of a query, that immediately gets a comment about why it was written the way it was. Another rule of thumb is if I wrote a query, found it to be incorrect and had to fix it because of something I wasn’t expecting or thought of, then it also gets a comment. (Think about
NULL. I know that now, but a new person reading my query later may not.)
-- Per-hour bounce rate calculation SELECT HOUR(SEC_TO_TIMESTAMP(visitStartTime)) AS sessionHour, -- Extracts the hour from the unix timestamp -- COUNT is an aggregate function and will only count non-NULL values COUNT(totals.bounces)/COUNT(totals.visits) AS bounceRate -- Computes the bounce rate. Don't use AVERAGE! it ignores NULLs -- Let's just use the one example table FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] GROUP BY sessionHour
Additionally, the capitalization used above (keywords in all caps and fields in CamelCase is only a convention to make reading easier. Capitalization doesn’t matter, but adhering to the convention is highly recommended and encouraged.
Example: Bounce Rates by Traffic Source Medium
GROUP BY works on any field, or group of fields, in the table or defined in the
SELECT clause is very powerful, especially when thinking about “some statistic” per “some value in a grouping”.
SELECT trafficSource.medium AS medium, COUNT(totals.bounces)/COUNT(totals.visits) AS bounceRate FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] GROUP BY medium
Now, let’s get crazy and group by multiple fields!
SELECT trafficSource.medium AS medium, HOUR(SEC_TO_TIMESTAMP(visitStartTime)) AS sessionHour, COUNT(totals.bounces)/COUNT(totals.visits) AS bounceRate FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] GROUP BY medium, sessionHour
“But!” you’re going to tell me, “they’re all in a random order!” and you would be correct! We need to
ORDER BY some field. Let’s do
SELECT trafficSource.medium AS medium, HOUR(SEC_TO_TIMESTAMP(visitStartTime)) AS sessionHour, COUNT(totals.bounces)/COUNT(totals.visits) AS bounceRate FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] GROUP BY medium, sessionHour ORDER BY sessionHour, medium
Example: Mediums with the Most Sessions per Hour
Now that you have the crazy juices flowing and feel like you can conquer any problem with just enough SQL, you begin to think “Hey, which mediums bring in the most people by hour!”
SELECT trafficSource.medium AS medium, HOUR(SEC_TO_TIMESTAMP(visitStartTime)) AS sessionHour, COUNT(totals.visits) AS sessionCount, COUNT(totals.bounces)/COUNT(totals.visits) AS bounceRate FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] GROUP BY medium, sessionHour ORDER BY sessionHour, sessionCount DESC
You might begin to see the types of questions that could be answered with BigQuery that you can’t in Google Analytics, for instance more detailed drill downs into user-based metrics. We’ll save a discussion of those for our next post however!
In addition to future blog posts on using BigQuery and other tools to analyze your data, the book Practical Google Analytics and Google Tag Manager for Developers recently published by Jonathan Weber and the rest of the team here has an entire chapter on learning and using BigQuery with Google Analytics! (In addition to a plethora of good information and best practices when using Google Tag Manager.)