Subset Your Google Analytics Data With R

August 17, 2016 | Kaelin Tessier


R is a relatively easy language to use when performing statistical and graphical analyses on data. However, after choosing your dataset, you may still need to subset it. Luckily, R’s built-in functionality can be especially helpful in getting the sections of data you need.

Since Becky gave us a great introduction to querying Google Analytics data in R, I’d like to go a step further and show you how to subset your data once you have it loaded in R. Below you’ll find some quick tips and tricks:

Step 1: Download Some Packages (Optional)

There are a lot of ways to get your data out of Google Analytics; however, one of my favorite ways is to use the rga package.

If you prefer a different method, feel free to skip Steps 1 and 2. If you need a refresher, follow the steps below:

# install devtools package for downloading packages from github

# install curl for easier use

# install rga package from github
library(rga) = "ga")

# REMEMBER: R is case-sensitive.

Step 2: Pull Data Out of Google Analytics (Optional)

Once you’ve downloaded these packages, grab the ID of the view that you would like to query. This can be found under Admin > View > View Settings > Basic Settings > View ID. We’re going to store this is a variable called “viewID”. Then we’ll pull out an example data set.

viewID <- "XXXXXXX" # Put your own view ID here

# Query GA to get your example data set
gaData <- ga$getData(viewId, = as.Date("2016-06-01"), = as.Date("2016-06-30"), 
metrics = "ga:sessions,ga:bounceRate,ga:sessionDuration", 
dimensions = "ga:date,ga:browser,ga:browserVersion,ga:operatingSystem")

Keep in mind that you can use any sessions or dimensions available in the Google Analytics interface (explore your options here). For the purposes of this example, gaData should look like the following:


Step 3: Subset Your Data

Let's start by understanding some of the basic syntax of subsetting data in R by reviewing how to select the entire gaData data frame.

# Select gaData
dataframe[row indices, column indices] # Example of proper R syntax (pseudo code)
workingData <- gaData[,] # Grabs all columns and rows, stores it in the "workingData" variable

When you view workingData, it should now be identical to gaData.

Subsetting Your Data with Inclusion Methods (Keeping Data)

Sometimes it's more efficient to keep the parts of the data that you want to use for analysis than to exclude them.

In this example, we will show you how to include only certain rows. Here, we will include only rows 1-5 of gaData in workingData. After each of these code blocks, workingData should look similar to this:

workingData Include Rows

# Option 1
workingData <- gaData[c(1:5),]

# Option 2
vars <- c(1:5)
workingData <- gaData[vars,]

# Option 3
vars <- c("1", "2", "3", "4", "5")
workingData <- gaData[vars,] 

# REMEMBER: You can look at your workingData data frame by typing the code below.
# Just keep in mind that every time you assign a new value to the variable, the old value is lost.

Now, we can include only certain columns. This time, we will only include the "browser" and "operatingSystem" columns. Afterwards, workingData should look like this:

workingData Include Columns

# Option 1
workingData <- gaData[,c("browser", "operatingSystem")]

# Option 2
vars <- c("browser", "operatingSystem")
workingData <- gaData[,vars]

Subsetting Your Data with Exclusion Methods (Omitting Data)

Other times, it's more efficient to exclude portions of your data set than to include them.

Here, we will exclude rows where "browser" = "Chrome" and "operatingSystem" = "Android". Your workingData should then look like the following:

Exclude Rows

# Option 1
workingData <- gaData[gaData$browser!="Chrome" & gaData$operatingSystem!="Android",]

# Option 2
vars <- c(gaData$browser!="Chrome" & gaData$operatingSystem!="Android")
workingData <- gaData[-vars,]

In this next example, we will exclude the "browser" and "operatingSystem" columns. Therefore, workingData should look like this:

Exclude Columns

# Option 1
workingData <- gaData[,c(-2,-4)]

# Option 2
vars <- names(gaData) %in% c("browser", "operatingSystem")
workingData <- gaData[,!vars]

By no means is this a comprehensive list of methods for including or excluding rows and columns in your Google Analytics dataset. However, this should help you get your feet wet and open your door to exploration.

If you have a favorite method of subsetting your data in R, please feel free to share with others by commenting below.