How To Find Personally Identifiable Data In Your Google Analytics
Keeping personally identifiable information, or PII, out of your GA data is vital for many reasons. From our perspective working primarily with Google Analytics, it simply should not be collected because it is against Google Analytics’ Terms of Service to collect this kind of data. In fact, it’s against the Terms of Service to send PII to Google Analytics servers, period! This means you can’t just remove PII from your GA with view filters – you should never send this information to Google Analytics in the first place.
Sam wrote about how to keep PII off of Google’s servers here. (Note that other Google tools have different terms, and it might be OK to use or store PII in them. Our focus here is on Google Analytics.)
But how do you know if your GA is collecting PII? Here, we cover a few regex expressions that you can use within the GA interface to find common PII. After all, you can’t take steps to remove PII from your data if you don’t know it’s there to begin with!
All of the following filters can be applied in the All Pages report in Google Analytics:
Get the Data into Google Sheets
We will use the Google Analytics API Add-On to populate our spreadsheet (step-by-step instructions to install and use the add-on can be found here).
To make things easier, we’re providing a customizable Google Sheet that will populate your PII trouble areas. To get a copy of this sheet, click the button below. Choose the File menu option, then Make a Copy.
You’ll need to install the Google Analytics Sheets Add-On, using the Add-On menu. Now, you can run the report to update the report with your own info!
Email addresses are by far the most common form of PII we find in GA data because they are so frequently passed as query parameters. In order to find any email addresses in your data, put the following in the view filter:
Yep – that’s it! Nothing fancy here. If a page path contains a “@”, it’s likely an email address and should not be in your data! If you want to be extra careful, you can also try filtering the table for an encoded @ as “%40”.
Names are often passed into GA via query parameters as well. We’re most likely to identify PII involving names by searching for some of the most common first names:
This regex focuses on common American first names; feel free to edit in order to include names that would be more common for your site’s traffic. And remember, this filter shouldn’t be designed to pull out every instance of user names, it should just include enough common names to determine whether or not you’re including this type of PII in your GA data.
First names aren’t generally considered to be PII on their own because you likely can’t identify a specific individual who visits your site with just their first name. But if a first name is being captured, a last name is probably somewhere in your data as well and that combination is definitely considered PII.
There are other combinations of otherwise non-PII data that, together, could be used to identify a specific user, so keep this in mind as you decide whether or not to exclude data like this from your GA data.
Phone numbers are captured less often than email addresses and names, but they still appear on occasion. To look for phone numbers that have a varying number of digits, delimiters, and other formats, use the following regex:
It can be difficult to find physical street addresses in your data because you need to search for abbreviations and short, common words, which can pull in a lot of unintended, unrelated results. Start with this filter and alter it as needed to verify that you are not collecting this type of PII:
Credit Card Numbers
Fortunately, I have never seen credit card information being passed into GA, however, I have heard of it happening. You definitely do not want to send this information to GA under any circumstances, so the regex for this PII is broadly defined:
If you find any of the above PII in your GA data, prioritize taking the necessary steps to exclude it. It’s better for everyone if personally identifiable information is never collected!
Here is more information about Google’s GA guidelines. If there are any more PII regex table filters you would be interested in seeing in addition to the ones covered in this post, let us know in the comments below!