Group Content By Usage Patterns With Google Analytics And LDA

March 21, 2017

There are often many ways of classifying your content. In fact, Google Analytics has five different available slots for creating content groups. These content groups allow you to see aggregated metrics for related content. However, before you can aggregate these metrics, you have to decide how you want to group the content. You may have informative directories or other information in the URL that is useful for grouping content. You can also use information on the page or push information from your CMS. In this post, I’ll discuss an alternative way of algorithmically grouping content by looking at which content items are typically viewed together.

The algorithm we are going to use is called the Latent Direichlet Allocation (LDA) algorithm. Typically, this algorithm takes in the text of all of your content. Then it spits out a list of topics (you pick the number of topics in advance) as well as an estimate of what percent of each content item is dedicated to that topic.

For example, when we ran the LDA algorithm over the LunaMetrics blogs data, we came up with four main topics: AdWords, Google Tag Manager, SEO, and Reporting. We can also see which content items talk about these topics. For example, the algorithm tells us that:

  • 9 Ways to Ensure Ad Tags Work in Google Tag Manager is 60% about Google Tag Manager, 18% about AdWords, and 11% evenly spread over the other two topics
  • Form Engagement Tracking with Google Tag Manager is overwhelmingly (85%) about Google Tag Manager
  • What’s Missing in the Google Analytics BigQuery Export Schema? Is overwhelmingly (85%) about Reporting

*As explained later in the post, these topics were named by us based on which content items were included in each topic. If your data plays nicely with the algorithm, topic names should be obvious.

Note that due to the nature of the algorithm, it’s unlikely that you will see an article with 100% assignment to a specific topic.

However, if you do not have access to an export of your content, or if some of your important content is not text heavy (large graphics, forms, videos, etc), you may want to consider an alternative way to implement LDA.

Focusing On Users’ Behavior

In this LDA adaption, we will classify content based on which pages users consume together rather than the content of each page. This adaption assumes there are several distinct reasons why users may visit your site, and there are certain content items associated with each purpose. When each user visits the site, they may have one or more reasons for visiting, and will visit the content items associated with that intent.

The input for the algorithm will now be a list of users and which content items they viewed. The output will be a list of content groups (topics) and which URLs are assigned to each content group. Technically speaking, the algorithm will assign each URL / topic combination a weight, since any given content item might discuss multiple topics. Below, I will show you how to pick out the most prominent topic for each content item, for ease of use. However, if you want to assign multiple topics to certain content items, the algorithm easily supports this. Also, each user will be given a score of how interested they are in each content group.

For the rest of the post, I will show you how to run this analysis using R and BigQuery. If you do not have BigQuery, but do have the client ID reliably stored as a custom dimension, you should still be able to run this analysis.

Load Packages

I will be using the following packages in this analysis:

library(bigrquery)
library(tidyverse)
library(lda)

Get Your Data

Set Your Parameters

Replace the values below with your own BigQuery project ID and view ID. You should also choose a date range.

startDate <- as.Date("2016-10-01")
endDate <- as.Date("2017-01-31")
BQProject <- "YOUR-BIGQUERY-PROJECT-ID-HERE"
# this should be the ID for the view that you exported to BigQuery
viewId <- "XXXXXX"

Get Data out of BigQuery

There are several packages out there for connecting R with BigQuery. I will use the bigrquery package by Hadley Wickham.

sql <- paste("Select fullVisitorId, GROUP_CONCAT(hits.page.pagePath, ' ') as path from (SELECT fullVisitorId, hits.page.pagePath 
FROM (TABLE_DATE_RANGE([", viewId, ".ga_sessions_], TIMESTAMP('", startDate, "'), 
TIMESTAMP('", endDate, "'))) WHERE hits.type = 'PAGE' GROUP BY fullVisitorId, hits.page.pagePath)
group by fullVisitorId", sep = "")

userPage <- query_exec(sql, project = BQProject, max_pages = Inf)

View(userPage[1:10,])

This query will pull a list of visitor IDs and the pages each visitor viewed during the time frame you selected. You will need to authenticate and decide if you want to store your credentials between sessions.

Run the Model

I will use the lda package. There is another popular R package, topicmodels, that integrates nicely with the tm text mining package. However, I found the lda package to run much faster on large sets of usage data.

Parse Your Data

First, we need to parse the data into a format that the lda package can use.

docs <- lexicalize(userPage$path)

Run LDA Algorithm

Next we will run the LDA algorithm. You can change the number of topics to fit your needs.

numTopics <- 5

lda <- lda.collapsed.gibbs.sampler(docs$documents, numTopics, docs$vocab, 25, .1, .1, compute.log.likelihood = TRUE)

View Results

View & Name Content Groups

Now we can view the content groupings and the top content items associated with each group. The algorithm will not name these content groups for us, but if there are clear patterns in the usage of your site, you should be able to get a feel for the groups based on the top content items listed. If there are no clear content groups, you may want to try re-running the algorithm with a different number of topics.

top.words <- top.topic.words(lda$topics, 20, by.score=TRUE)
View(top.words)

Assign Each URL to a Content Group

The LDA algorithm will assign what "percentage" of each content item falls under each content group. What we would like to do is find the content group that is most correlated with each URL. The code below will write out a csv file with each URL and its highest scoring content group. Note that the content groups are distinguished by number and not name (V1, V2, etc). You will need to do a find / replace in this file to rename the content groups with the names you chose in the previous step.

# for each URL, find the percent of time that URL was assigned to each topic
urlProportions <- t(lda$topics) / colSums(lda$topics)

# turn this into a dataframe for easier handling
urlProportions <- as.data.frame(urlProportions)

# for each URL, find the topic that URL was assigned to most often
urlTopic <- data.frame(url = rownames(urlProportions), topic = colnames(urlProportions)[max.col(urlProportions, ties.method="first")])

# view the assigned topic by URL
View(urlTopic)

# write to a csv
write.csv(urlTopic, "urlTopic.csv", row.names = FALSE)

Optional – Assign a topic to each of your users

Using this adaptation of the LDA algorithm, we can assign a content group to each user based on which types of content she views most often. This data could then be imported back into BigQuery for creating segments of users for further analysis.

# for each user, find the percent of time that user spent on each topic
userProportions <- t(lda$document_sums) / colSums(lda$document_sums)

# turn this into a dataframe for easier handling
userProportions <- as.data.frame(userProportions)

# for each URL, find the topic that URL was assigned to most often
userTopic <- data.frame(user = userPage$fullVisitorId, topic = colnames(userProportions)[max.col(userProportions, ties.method="first")])

# view the assigned topic by URL
View(userTopic)

# write to a csv
write.csv(userTopic, "userTopic.csv", row.names = FALSE)

I Don’t Have Analytics 360

If you are interested in this level of in-depth analysis, you may want to consider Analytics 360 and the BigQuery integration for future projects. However, in this case, you can get around 360 if you have the user’s client ID consistently stored in a custom dimension. The code below uses the Google Analytics API to pull a list of content by client ID and then format it in a way that the LDA library can handle.

As an example, I assumed that your client ID is stored in the custom dimension index 10. You should make a note of which index you are using to store the custom dimension, then replace dimension10 with whatever index you are using. You can swap this out for the Get Data out of BigQuery section and run the rest of the code as is.

library(RGA)
userPath <- get_ga(viewId, start.date = startDate, end.date = endDate,
      dimensions="ga:pagePath,ga:dimension10", metrics = "ga:pageviews")
userPage <- userPath %>% group_by(dimension10) %>%
  summarise(path = paste(pagePath, collapse = " ")) %>%
  mutate(fullVisitorId = dimension10)

What Now?

Although this analysis is done outside of Google Analytics, it uses data that we are already collecting. Where before you might group content by its URL pattern, or some dimension from your content management system, we're using actual user-specific data from our website to create these classifications. By grouping content based on how it is consumed on our website, we are able to learn more about how our content is actually performing compared to our intent or pre-selected categories. We can learn a lot from the algorithmic classification process by itself, examining how content is grouped and which classifications defy our assumptions. What insights can you draw from the way your content was grouped?

If we wanted to emulate content grouping inside of Google Analytics and aggregate metrics for these groups, there are few options. It might be easiest to continue this analysis outside of Google Analytics and roll up the pageview metrics for each group. You can also use the data import functionality to bring these groupings into Google Analytics as Custom Dimensions, then viewing them in a Custom Report. There are a few differences between Custom Dimensions and Content Groups. Keep in mind that with either process, we will need to update the groupings to account for new content as it is added and continually changing user behavior