Mage Data strengthens its data security posture with the ISO 27001 certification. READ MORE >

November 12, 2020

Reidentification Risk of Masked Datasets: Part 1

The following article is also hosted on Forbes Tech Council. 

Data masking, anonymization and pseudonymization aren’t new concepts. In fact, as the founder of a data security company, I’ve had 15 years of experience in helping my clients drive business decisions, enable transactions, allow for secure data sharing in a B2B or consumer scenario, or meet audit and financial requirements.

One constant in my experience has been the issue of data reidentification. In many instances, companies take a one-size-fits-all easy route to data security, such as using Pretty Good Privacy (PGP) and thinking that 95% of the world won’t be able to reverse it only to find that this assumption is incorrect.

Take, for example, cases in which customers go into an ERP application and change all Social Security numbers to the most popular combination—999 99 9999. This is highly secure, of course, but renders the data nonfunctional.

On the other hand, by staying too close to the data, you may ensure that the data performs well, but the trade-off ends up compromising security. For instance, you can add 1 to the Social Security number 777 29 1234, and it becomes 777 29 1235. Or you can add it to the digits in between, and it becomes 777 30 1234. In both cases, though, the data functionality is preserved at the cost of its security. The algorithm can be cracked, and the data is easily reversible. 

Some companies have even kicked the tires on synthetic data (pseudo values generated by algorithms), but that also negatively impacts testing since non-production environments need large volumes of data that carry the complexity of real data in order to test effectively and build the right enhancements. As the name suggests, synthetic data is in no way connected to reality. 

Theoretically speaking, it may seem like a great way to go about securing your information, but no machine algorithm can ever mimic the potency of actual data an organization accrues during its lifespan. Most projects that rely on synthetic data fail for the simple reason that it doesn’t allow for complexity in the interaction between data because, unlike actual data, it’s too uniform. 

There’s always a trade-off: Either your data is highly secure but low in functionality or vice versa. In search of alternatives, companies move to more sophisticated methods of data protection, such as anonymization and masking, to ensure data security in regard to functionality and performance. 

There is a catch, however, and it’s the reason why it’s so important that proper and complete anonymization and masking are critical: Cross-referencing the data with other publicly available data can reidentify an individual from their metadata. As a result, private information such as PFI, PHI and contact information could end up in the public domain. In the wrong hands, this could be catastrophic. 

There have been many incidents in which this has already happened. In 2014, the New York City Taxi and Limousine Commission released anonymized datasets of about 173 million taxi trips. The anonymization, however, was so inadequate that a grad student was able to cross-reference this data with Google images on “celebrities in taxis in Manhattan” to find out celebrities’ pickup and drop-off destinations and even how much they paid their drivers. 

Another example is the Netflix Prize dataset contest, in which poorly anonymized records of customer data were reversed by comparing it against the Internet Movie Database. What is most alarming in situations like these is the information set that can be created by aggregating information from other sources. By combining indirectly related factors, one can reidentify seemingly anonymized personal data. 

Research conducted by the Imperial College London found that “once bought, the data can often be reverse-engineered using machine learning to reidentify individuals, despite the anonymization techniques. This could expose sensitive information about personally identified individuals and allow buyers to build increasingly comprehensive personal profiles of individuals. The research demonstrates for the first time how easily and accurately this can be done — even with incomplete datasets. In the research, 99.98% of Americans were correctly reidentified in any available ‘anonymized’ dataset by using just 15 characteristics, including age, gender and marital status.” 

While you fight between ensuring both data security and data functionality, you find yourself in a bind, choosing what to trade for the other. But does it have to be this way? 

To keep your business going, of course, you’re going to have to make compromises — but in a way that doesn’t challenge the security, performance or functionality of the data nor the compliance or privacy of the individual you’re trying to protect. 

We now have an awareness that data masking, anonymization and pseudonymization might not be as easy as we thought it would be. As a concept, it may seem deceptively simple, but approaching it in a simple, straightforward manner will limit your ability to scale your solution to fit the many ways your business will need to apply anonymization. This inability to scale in terms of masking methodologies ends up in project failure.  

The focus really needs to be on reducing the risk of reidentification while preserving the functionality of the data (data richness, demographics, etc.). In part two, I will talk about how you can achieve this.