Self‑Joins, Windowing, And User Defined Functions In BigQuery

May 12, 2016
Self-Joins, Windowing, And User Defined Functions In BigQuery

Google’s BigQuery offers unprecedented access to Google Analytics data for Google Analytics 360 (Premium) customers. With great data comes great challenges, especially when you start attempting complicated calculations beyond the traditional metrics in Google Analytics.

There are many ways to write queries in Google’s BigQuery, each has its strengths and weakness. Some of the basics of writing queries were covered in a previous post, but here we’re going to look at three different approaches to writing complex queries. As with any situation where there are multiple options, each option has use-cases in which they simplify the overall amount of effort needed to reach the goal.

The first method, using a self-join, is flexible and straight-forward, but can be very slow. The second, using windowing functions, is efficient and fairly straight-forward, but also limiting in what exactly can be queried. The last method we’ll discuss, User Defined Functions (UDFs), can be efficient on certain types of computations and can make expressing certain types of functions more straight-forward, but comes at the cost of speed.

Self-Joins

Self-Joins are when you join a table to itself, and one of many different ways to join a table to other datasources. Why would you ever want to join a table to itself? The most common reason is to find rows in a table that are related to each other. While this provides a great deal of flexibility, joins in BigQuery are inefficient — the larger the “smaller” table becomes, the more data needs to be shipped between nodes.

When self-joining, it’s possible to get into a situation where the entire table needs to be shipped to every node working on the query, as opposed to just the single, or small handful, that it would need otherwise.

An example of a query that could only be answered with a self-join would be a question like this: of people who viewed one specific page, which other pages did they view and how many times?

(Note: the Cross Join used below is not what makes a self join, but that the table being joined is the same. The Join Type is irrelevant.)

SELECT
  sp as pageA,
  pp as pageB,
  COUNT(*) AS crossCount
FROM (
  SELECT
    sess.fullvisitorid AS sfvi,
    pages.fullVisitorid AS pfvi,
    sess.hits.page.pagePath AS sp,
    pages.hits.page.pagePath AS pp
  FROM
    FLATTEN([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910], hits) AS sess
  CROSS JOIN
    FLATTEN([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910], hits) AS pages
  WHERE
    -- ensures each combo is only counted a single time
    pages.fullvisitorid < sess.fullvisitorid
    AND pages.hits.page.pagePath < sess.hits.page.pagePath
  GROUP BY sfvi, pfvi, sp, pp )
GROUP BY pageA, pageB
ORDER BY pageA, crossCount DESC

This could then loaded into a spreadsheet and a pivot table could be used to look at trends in how people visit pages.

Pivot Table of LCH co-pageviews

Another example would be to find the the page viewed before a page, we could find all related pages in a session using a self-join, filter out, using a WHERE clause because in BigQuery join conditions, in the ON, cannot have inequalities, all hits who have greater hit numbers, and then aggregating all the results for each hit and finding the greatest pageview less than the current one.

(Note: Again I’ll point out that the Inner Join used below is not what makes a self join, but that the table being joined is the same. The Join Type is irrelevant.)

SELECT
  cur_hit.fullVisitorId AS fullVisitorId,
  cur_hit.visitId AS visitID,
  cur_hit.hits.hitNumber AS cur_hitnumber,
  cur_hit.hits.page.pagePath as cur_pagePath,
  cur_hit.hits.time AS cur_time,
  MAX(prev_hit.hits.hitNumber) AS prev_hitNumber,
FROM
  FLATTEN([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910], hits) AS prev_hit
INNER JOIN
  FLATTEN([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910], hits) AS cur_hit
ON
  prev_hit.fullvisitorid = cur_hit.fullvisitorid
  AND prev_hit.visitid = cur_hit.visitid
WHERE
  prev_hit.hits.hitNumber < cur_hit.hits.hitNumber
  AND prev_hit.hits.type = "PAGE"
  AND cur_hit.hits.type = "PAGE"
GROUP BY fullVisitorId, visitID, cur_hitnumber, cur_pagePath, cur_time

However, as we will see in the following sections, this particular query can be done much more cleanly with windowing functions.

Windowing Functions

Windowing Functions allow you to view smaller portions of your data by allowing access to rows before and after the row currently being processed, in essence opening up “window” that allows you to see more than just the current row and to be able to see your data in different ways.

To help us understand this concept, let’s look at a simple example. Our “table” is the sequence 2,4,3,1 in a field named “num”.

SELECT
  LAG(num) OVER () AS lag,
  num,
  LEAD(num) OVER () AS lead,
FROM 
  ( SELECT 2 AS num),
  ( SELECT 4 AS num),
  ( SELECT 3 AS num),
  ( SELECT 1 AS num),

Note the odd OVER () for LAG and LEAD. This tells LAG and LEAD to use the entire table as their window. This is similar to the manner in which aggregate functions behave. More on that later!

Row

lag

num

lead

1 null 2 4
2 2 4 3
3 4 3 1
4 3 1 null

Note that LAG and LEAD are processing the rows in order of their appearance in the table. If we want them to process rows in sorted order by specifying their order in the OVER clause, as each window can be ordered by different values and in different directions, or in an ORDER BY clause.

SELECT
  LAG(num) OVER (ORDER BY num) AS ordered_lag,
  LAG(num) OVER () AS lag,
  num,
  LEAD(num) OVER () AS lead,
  LEAD(num) OVER (ORDER BY num) AS ordered_lead,
FROM 
  ( SELECT 2 AS num),
  ( SELECT 4 AS num),
  ( SELECT 3 AS num),
  ( SELECT 1 AS num),

 

Row ordered_lag lag num lead ordered_lead
1 null 3 1 null 2
2 1 null 2 4 3
3 2 4 3 1 4
4 3 2 4 3 null

Also note that the final output of rows is ordered by the window, not in table order as before. Even though the table is ordered, the LAG and LEAD OVER () return the values in the unordered window.

One of the many useful ways to use a windowing function is to modify the aggregate functions we know and love, e.g. SUM, AVG, STDEV, MIN, and MAX.

SELECT
  num,
  MAX(num) OVER (ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING) as max_here_1_following
FROM 
  ( SELECT 2 AS num),
  ( SELECT 4 AS num),
  ( SELECT 3 AS num),
  ( SELECT 1 AS num),

Row

num

max_here_1_following

1 2 4
2 4 4
3 3 3
4 1 1

As stated before, the windows need not be the same for every function. While this complicates understanding the query (remember! the query is ordered by the final ORDER BY), it can be useful at times. Here is a slightly convoluted query illustrating these points.

SELECT
  num,
  AVG(num) OVER (ROWS BETWEEN 1 PRECEDING AND CURRENT ROW ) as moving_avg,
  AVG(num) OVER (ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING ) as moving_avg_middled,
  LAG(num) OVER (ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING ) as lag_mid,
  LEAD(num) OVER (ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING ) as lead_mid,
  MAX(num) OVER (ROWS BETWEEN CURRENT ROW AND 3 FOLLOWING ) as max_here_3_more,
  MAX(num) OVER (ORDER BY num DESC ROWS BETWEEN CURRENT ROW AND 3 FOLLOWING ) as max_here_3_more_desc,
  MIN(num) OVER (ORDER BY num ASC ROWS BETWEEN CURRENT ROW AND 3 FOLLOWING ) as max_here_3_more_asc,
FROM 
  ( SELECT 2 AS num),
  ( SELECT 4 AS num),
  ( SELECT 3 AS num),
  ( SELECT 1 AS num),

Row

num

moving_avg

moving_avg
_middled

lag_mid

lead_mid

max_here
_3_more

max_here_3
_more _desc

max_here_3_
more_asc

1 4 3.0 3.0 2 3 4 4 4
2 3 3.5 2.66 4 1 3 3 3
3 2 2.0 3.0 null 4 4 2 2
4 1 2.0 2.0 3 null 1 1 1

We’re not done with windowing functions yet! Remember how I said that aggregate functions behave as though they were OVER ()? The GROUP BY can be done within the window! However, they’re called “partitions” and you PARTITION BY in a window.

SELECT
  num,
  COUNT(num) OVER (PARTITION BY num) AS partion_count,
  COUNT(num) OVER () AS plain_count
FROM 
  ( SELECT 2 AS num),
  ( SELECT 4 AS num),
  ( SELECT 3 AS num),
  ( SELECT 1 AS num),
  ( SELECT 1 AS num),

Row

num

partion_count

plain_count

1 4 1 5
2 3 1 5
3 1 2 5
4 1 2 5
5 2 1 5

Coming back to Google Analytics, and how to apply windowing functions to it, we can simplify the query to find time-on-page significantly, both in visual and execution complexity.

SELECT
  fullVisitorId,
  visitId,
  hits.hitNumber,
  hits.page.pagePath,
  -- get next and previous hit time to be able to work out length of each hit
  hits.time AS hit_time,
  LEAD(hits.time, 1) OVER (PARTITION BY fullVisitorId, visitId ORDER BY hits.time ASC) AS next_hit_time,
  FROM
    [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
  WHERE
    hits.type = "PAGE"

For another example, one of the reasons we could have used a self-join would be to get the number of a certain kind of event in the next 2 hits following a pageview. We can do this succinctly (and efficiently!) with a windowing function! (“Jim, look. Surely you must know that the page is attached to the event and we can just do a simple GROUP BY and COUNT!” // “Yes, but what if we’re on a different page now! Maybe I want to know how many people login shortly after visiting a product.”).

SELECT
  COUNT(*) as logins_after_product_view
FROM (
  SELECT
    SUM(IF(hits.eventinfo.eventCategory = "Login", 1, 0)) OVER (PARTITION BY fullvisitorid, visitid ROWS BETWEEN 1 FOLLOWING AND 3 FOLLOWING) AS login_count,
    hits.page.pagePath as pagePath
  FROM
    [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] )
WHERE
  login_count > 0
  AND (pagePath CONTAINS 'vests' OR pagePath CONTAINS 'helments')

Row

logins_after_product_view

1 5

An interesting use of windowing functions is to compute which transaction or event a session precedes. Here we take advantage of a window function being executed over each row individually, but carrying the totals forward. If a session has a transaction, then we give it a 1, otherwise a 0. We then add up all of the values after the current session (we’re sorting descending). The sum will only increment if there was a transaction and stay constant for all sessions chronologically before the transaction until the next transaction. This gives us a value that is the same for the transaction and all sessions before it and after the previous transaction.

SELECT
  fullVisitorId,
  visitId,
  SEC_TO_TIMESTAMP(visitStartTime) AS time,
  transacted,
  SUM(transacted) OVER (PARTITION BY fullVisitorId ORDER BY visitStartTime DESC) AS transaction_group
FROM (
  SELECT
    fullVisitorId,
    visitId,
    visitStartTime,
    MAX(IF(hits.item.transactionID IS NOT NULL, 1, 0)) AS transacted
  FROM (
    SELECT
      fullVisitorId,
      visitId,
      visitStartTime,
      hits.item.transactionID
    FROM
      [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910] )
  GROUP BY
    fullVisitorId,
    visitId,
    visitStartTime)
WHERE
  fullVisitorId='5711522334224447562'

fullVisitorId

visitId

time

transacted

transaction_group

5711522334224447562 1378805285 2013-09-10 09:28:05 UTC 0 0
5711522334224447562 1378804654 2013-09-10 09:17:34 UTC 1 1
5711522334224447562 1378803724 2013-09-10 09:02:04 UTC 0 1

User Defined Functions

User Defined Functions (UDFs) are JavaScript functions running in the BigQuery database (using V8). These functions are run as “close” to your data as possible to minimize the amount of time spent shuttling data around. UDFs can do anything you can write a program to do, as such they are extremely powerful, and also extremely easy to abuse and over-complicate a query. More on the details of UDFs can be found in the BigQuery documentation.

The long and short is that the input to a UDF is a query whose output fields are named the same as the fields when registering the UDF (below). The UDF is run once per row, however the UDF may output multiple rows. Take for example, calculating the time on the page, the session and list of hits are a single input row, and multiple rows of output are generated giving the difference between each pageview.

SELECT
  fullVisitorId,
  visitId,
  hitNumber,
  pagePath,
  diff AS timeOnPage_ms
FROM
  timePerPage(
  SELECT
    fullVisitorId,
    visitId,
    NEST(hits.page.pagePath) AS pages,
    NEST(hits.type) AS types,
    NEST(hits.time) AS times
  FROM
    [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
  GROUP BY
    fullVisitorId,
    visitId
  LIMIT
    10)

This query also introduces the NEST function, which aggregates all rows into a single repeated field. Repeated fields (similar to how hits and custom dimensions are normally stored) are exposed to our UDF as an array.

To edit the UDF, click on the UDF Editor button in the top right corner.

Example UDF Screen

function timePerPage(row, emit) {
  // old index -- index of the previous page
  var oi = 0;
  // index -- index of the current page
  var i = 1;
  for (; i < row.types.length;  ) {
    var prev_type = row.types[oi];
    var prev_ts   = row.times[oi];
    
    var curr_type = row.types[i];
    var curr_ts   = row.times[i];
    
    if (curr_type != 'PAGE') { i++; continue; }
    if (prev_type != 'PAGE') { oi = i; i++; continue; }
    
    emit({fullVisitorId: row.fullVisitorId, 
          visitId: row.visitId,
          pagepath: row.pages[oi],
          hitNumber: oi + 1,
          diff: curr_ts - prev_ts});
    oi = i;
    i++;
  }
}

// This is what allows you to use the function in BigQuery
bigquery.defineFunction(
  'timeperpage', // Name in BigQuery
  ['fullVisitorId', 'visitId', 'types', 'times', 'pages'], // Input field names
  [{'name': 'diff', 'type': 'integer'}, // Output Schema
   {'name': 'pagepath', 'type': 'string'},
   {'name': 'fullVisitorId', 'type': 'string'},
   {'name': 'visitId', 'type': 'integer'},
  {'name': 'hitNumber', 'type': 'integer'}],
  timePerPage // reference to the javascript function to run
);

Because UDFs are a full-fledged programming language, we can compute multiple values or introduce complexities that would be hard to model in SQL. For instance, let’s say we’d like to combine subsequent pages if they are the same URL. We could simply update our function to not increment oi or emit a row.

function timePerPage(row, emit) {
  var oi = 0
  var i = 1;
  for (; i < row.types.length;  ) {
    var prev_type = row.types[oi];
    var prev_ts   = row.times[oi];
    
    var curr_type = row.types[i];
    var curr_ts   = row.times[i];
    
    if (curr_type != 'PAGE') { i++; continue; }
    if (prev_type != 'PAGE') { oi = i; i++; continue; }
    
    // consider subsequent pageviews the additional time on the first hit
    if (row.pages[oi] == row.pages[i]) { i++; continue; }
    emit({fullVisitorId: row.fullVisitorId, 
          visitId: row.visitId,
          pagepath: row.pages[oi],
          hitNumber: oi + 1,
          diff: curr_ts - prev_ts});
    oi = i;
    i++;
  }
}

Conclusion

BigQuery offers many powerful ways for you to leverage your data, but have caveats and fortes each their own. By playing around with the small London Cycle Helmet dataset, other example data Google provides, as well as your own data, you can feel your way through these different methods and learn how to leverage them to your advantage.