Mage Data strengthens its data security posture with the ISO 27001 certification. READ MORE >

November 12, 2020

Reidentification Risk of Masked Datasets: Part 2

This article is a continuation of Reidentification Risk of Masked Datasets: Part I, where we discussed how organizations progressed from simple to sophisticated methods of data security, and how even then, they faced the challenge of reidentification. In its conclusion, we shared what companies really need to focus on while anonymizing their data.

Now, we delve into the subject of reidentification and how to go about achieving your goal, which is ultimately to reduce or eliminate the risk of reidentification.

Before we dive further into this, it helps to understand a few basic concepts.

What is reidentification and reidentification risk?

A direct or an indirect possibility that the original data could be deciphered depends on the dataset and the method of anonymization. This is called reidentification, and the associated risk is appropriately named reidentification risk.

The NY cab example that we saw in Part 1 of this article is a classic case of reidentification, where the combination of indirectly related factors led to the reidentification of seemingly anonymized personal data.

Understanding the terms data classification and direct identifiers

Data classification or an identifier is any data element that can be used to identify an individual, either by itself or in combination with another element, such as name, gender, DOB, employee ID, SSN, age, phone number, ZIP code and so forth. Certain specific data classifications — for instance, employee ID, SSN and phone number — are unique or direct identifiers (i.e., they can be used to uniquely identify an individual). Name and age, on the other hand, are not unique identifiers since there’s repeatability in a large dataset.

Understanding indirect identifiers and reidentification risk through a simple example

Let’s say you take a dataset of 100 employees and you’re tasked with finding a specific employee in her 40s. Assume that all direct identifiers have been anonymized. Now you look at indirect identifiers, such as race/ethnicity, city or, say, her bus route — and sure enough, you’ve identified her. Indirect identifiers depend on the dataset and, therefore, are distinct for different datasets. Even though the unique identifiers have been anonymized, you can’t say for sure that an individual can never be identified given that every dataset carries indirect identifiers, leading to the risk of reidentification.

What are quasi-identifiers?

Quasi-identifiers are a combination of data classifications that, when considered together, will be able to uniquely identify a person or an entity. As previously mentioned, studies have found that the five-digit ZIP code, birth date and gender form a quasi-identifier, which can uniquely identify 87% of the American population.

Now that we are on the same page with essential terms, let’s get to the question: How do you go about choosing the right solution that minimizes or eliminates the risk of reidentification while still preserving the functionality of the data?

The answer lies in taking a risk-versus-value approach.

First, determine the reidentification risk carried by each identifier in your dataset. Identify its uniqueness and whether it is a direct or indirect identifier, then look at the possible combinations of quasi-identifiers. How high of a risk does a particular identifier pose, either on its own or combined with another piece of data? Depending on how big and diverse a dataset is, there are likely to be quite a few identifiers that can, singly or in combination, reidentify someone. If we assume that is unique and anonymize each based on the risk it poses, regardless of what was done to another data element, you start looking at anonymized datasets that are all data-rich but have removed the possibility of getting back the original data.  

In other words, this approach maintains demographical logic but not absolute demographics. Now that we’ve preserved data functionality, let’s address eliminating reidentification risk. Remember the quasi-identifier (ZIP Code, DOB and gender) that can uniquely identify 87% of the U.S. population? We can address this issue the same way, by maintaining demographical logic in the anonymized data. For example, Elsie, who was born on 1/21/76, can be anonymized to Annie born on 1/23/76. 

Notice that:  

  1. The gender remains the same (while the name is changed with the same number of characters).
  2. The ZIP code remains the same (while the street is changed).
  3. The date of birth changed by two days (while the month and year remain the same).

This dataset maintains the same demographic data, which is ideal for analytics, without giving away any PII and, at the same time, plans ahead for and eliminates the risk of reidentification. 

A practical solution lies in keeping the ways in which data can be reidentified in mind when applying anonymization methods. The right approach maintains the dataset’s richness and value — for the purpose it has been originally intended while minimizing or eliminating the reidentification risk of your masked datasets — an approach that makes for a comprehensive and effective data security solution for your sensitive data. 

By taking a risk-based approach to data masking, you can be assured that your data has been truly anonymized.