Logging Raw Google Analytics Data Using Keen IO

September 2, 2014 | Jonathan Weber

blog-keen

Google Analytics export to BigQuery is great for getting at the raw session-level data of Google Analytics. But, it’s only for GA Premium (GAP) subscribers. If you have other reasons to need GAP – like increased sampling limits, DoubleClick integration, or additional custom dimensions — and you have the money to spend, GAP is a great option.

Raw GA data?

But what if you’re not a GAP subscriber? Can you still get the raw, session-level data?

In a word: no (at least not from GA). All of the data in GA reports and in its associated reporting APIs is aggregated data. You can create and export reports full of dimensions and metrics, but there’s no report that can give you all of the information for each session the way BigQuery can.

We can do this

Fortunately, there’s an alternative: send the raw data into a repository where it’s easily accessible. Especially if you’re using Google Tag Manager, it’s pretty easy to fire an additional tag at the same time using the same rules as Google Analytics.

There’s a third-party tool that’s perfect for this kind of logging: Keen IO. Keen IO is intended for gathering unstructured event data from any source you like: websites, games, devices. I say “unstructured” because you are free to send any kind of data you like; there are no requirements except a timestamp and whatever set of properties you’d like to record.

Keen IO is a subscription software service. You don’t have to worry about any of the details of how the data is stored or where, and you pay based on the number of events you send per month (up to 50,000/month is free, so it’s easy to try out with no commitment, and it scales at reasonable prices from there). You can send data using a really simple REST API and JSON for the data, or there are ready-to-go SDKs in a variety of languages (including JavaScript, which is good for us in collecting web data).

There are also simple APIs for querying or exporting the data for analysis. Keen’s query APIs are fairly limited (compared to BigQuery, for example), but for moderate volumes of data (like the number of hits within the non-Premium GA limit of 10 million/month), they’re absolutely fine. Keen doesn’t have any built-in reports, so you’ll be pulling queries or extracting data to use with another tool for analysis or visualization (just like BigQuery).

(Would I recommend that you only use Keen IO and drop GA altogether? Definitely not. They’re both great at what they do. Keen IO fills a gap in GA, but it doesn’t recreate all the functionality of GA.)

How it works

  1. Keen IO JavaScript library and configuration. First, you’ll have to sign up for a (free) Keen IO account. Then we can use Keen IO’s JavaScript library by including the following script in your pages:

    This just loads the library and sets up a tracker object (here I’ve called it keenTracker) with the project ID and write key that Keen supplies you when you sign up.

  2. Tracking pageviews, events, and any other interactions. To log something to Keen, we can use the following script:
    keenTracker.addEvent("www.example.com", keenData);
    

    Of course, we’re going to have to fill in that data bit (it takes a JSON object) but we’ll get to that in a second. If you’re using GTM (which will be by far the easiest way to do this, rather than hard-coding in your pages), you’ll probably want to create one or more tags with the code from step 1 plus the code from step 2, that then take various macros and information from your dataLayer to fill in the data (which we’ll get to next).

  3. Leveraging Google Analytics & Tag Manager data. So how are we filling in all of this data? First off, let’s leverage all the things GA or GTM can tell us. GA has a client ID (a unique ID for the device which is used to count users), as well as a bunch of information about the size of the screen and other details we might be interested in. You can access these properties by using the get command in GA, like this:
    ga(function(tracker){
        dataLayer.push({
            "clientId": tracker.get("clientId")
            })
        })
    

    Here I simply pushed the value to the dataLayer, since it will be easy to grab in GTM. Besides clientId, there’s a slew of system details you may want to grab like screen size and so on - the field reference for GA gives all the names of these properties. Note that these values won’t be available until after GA does its thing, so you want to use tag priority to manage their order.

    Besides the stuff GA knows, GTM can also get the value of items such as the URL of the page, the referrer, the values of cookies, and more. Use whatever you need!

    Then we’ll likely want to fill all this in for the Keen data using some macros (from the dataLayer or anywhere else you like). In your tag, before the addEvent command, let’s set up the data:

    var keenData = {
        'clientId': {{Client Id macro}},
        ...and so on...
    };
    

    You may need a couple of different kinds of tags (one to match the types of data in GA pageviews, another in GA events). You’ll want to set them up on the same rules as your GA tags in GTM.

    (Keen even gives a full recipe for capturing pageviews, although we’re short-circuiting some of the steps by leveraging information that already exists in GA & GTM.)

  4. Enrich data with Keen’s add-ons. Keen also has some add-ons that help enrich the data. For example, it can capture the user agent (browser type and version) and parse that into its various pieces. The most important of these is for geo-location by IP address. If you capture the IP address in the data you send, you can have Keen add a city, state, country, and postal code. That’s something that GA also does, but it happens after the data is sent, so we’ll want to replicate this in our Keen data. More documentation on Keen’s add-ons here.

Reading the data

Once the data is in Keen IO, it’s just a big list of all the properties you sent. You can query it using Keen IO’s read APIs, which let you do a variety of types of queries from simple counting and summing, to filtering and funnel analysis.

This process is a little different from BigQuery, in that Keen uses a simple REST API with parameters to define the query rather than a SQL-like query syntax. Many of the most common tasks can be accomplished through this API, but note that you can also use the API to extract an entire data set to another tool (BigQuery, even, or to an analysis or visualization tool like Tableau or R).

Enhancements

There’s lots of additional logic you could use to enhance this data and make it even better:

  • Implement a cookie to sessionize events on the client side.
  • Process GA campaign tags contained in URL query parameters into the data.
  • Leverage GA’s new tasks feature to improve your Keen IO data collection. For example, you could abort the event if the page is rendering in a “Top Sites” preview in Safari, or check whether cookies are enabled.

Other tools

Snowplow Analytics is similar to Keen IO, and is available both as software-as-a-service, as well as open source software for running your own data warehouse on top of technologies like Amazon S3 and Redshift for storage. It’s a solid solution, and in some ways even better suited than Keen IO to this problem, albeit a bit more complicated to set up since it relies on underlying Amazon Web Services. Definitely worth taking a look at if you are seriously considering a solution like this one.

BigQuery itself recently started supporting streaming data import (rather than batch jobs), but given the way authorization works in BigQuery, it’s not really appropriate for client-side tracking. If you were sending data from the server-side, it would be a possibility.

Whatever tool you choose, it should be obvious that the power of Google Tag Manager and one of these data logging tools in combination can empower you to collect raw interaction data for research and analysis.