Mage Data strengthens its data security posture with the ISO 27001 certification. READ MORE >



April 29, 2022

6 Common Data Anonymization Mistakes Businesses Make Every Day

Data is a crucial resource for businesses today, but using data legally and ethically often requires data anonymization. Laws like the GDPR in Europe require companies to ensure that personal data is kept private, limiting what companies can do with personal data. Data anonymization allows companies to perform critical operations—like forecasting—with data that preserves the original’s characteristics but lacks the personally identifying data points that could harm its users if leaked or misused.

Despite the importance of data anonymization, there are many mistakes that companies regularly make when performing this process. These companies’ errors are not only dangerous to their users, but could also subject them to regulatory action in a growing number of countries. Here are six of the most-common data anonymization mistakes that you should avoid.

1.      Only changing obvious personal identification indicators

One of the trickiest parts of anonymizing a dataset is determining what is or isn’t Personally Identifiable Information (PII) is the kind of information you want to ensure is kept safe. Individual information like date of purchase or the amount paid may not be personal information, but a credit card number or a name would be. Of course, you could go through the dataset by hand and ensure that all relevant data types are anonymized, but there’s still a chance that something slips through the cracks.

For example, if data is in an unstructured column, it may not appear on search results when you’re looking for PII. Or a benign-looking column may exist separately in another table, allowing bad actors to reconstruct the original user identities if they got access to both tables. Small mistakes like these can doom an anonymization project to failure before it even begins.

2.      Mistaking synthetic data for anonymized data

Anonymizing or “masking” data takes PII in datasets and alters it so that it can’t be traced back to the original user. Another approach to data security is to instead create “synthetic” datasets. Synthetic datasets attempt to recreate the relationships between data-points in the original dataset while creating an entirely new set of data points.

Synthetic data may or may not live up to its claims of preserving the original relationships. If it doesn’t, it may not be useful for your intended purposes. However, even if the connections are good, treating synthesized data like it’s anonymized or vice versa can lead to mistakes in interpreting the data or ensuring that it is properly stored or distributed.

3.      Confusing anonymization with pseudonymization

According to the EU’s GDPR, data is anonymized when it can no longer be reverse engineered to reveal the original PII. Pseudonymization, in comparison, replaces PII with different information of the same type. Pseudonymization doesn’t guarantee that the dataset cannot be reverse engineered if another dataset is brought in to fill in the blanks.

Consequently, anonymized data is generally exempted from GDPR. Pseudonymization is still subject to regulations, albeit reduced relative to normal data. Companies that don’t correctly categorize their data into one bucket or the other could face heavy regulatory action for violating the GDPR or other data laws worldwide.

4.      Only anonymizing one data set

One of the common threats we’ve covered so far is the threat of personal information being reconstructed by introducing a non-anonymized database to the mix. There’s an easy solution to that problem. Instead of anonymizing only one dataset, why not anonymize all of the ones that share data. That way, it would be impossible to reconstruct the original data.

Of course, that’s not always going to be possible in a production environment. You may still need the original data for a variety of reasons. However, suppose you’re ever anonymizing data and sending it beyond the bounds of your organization. In that case, you have to consider the variety of interconnections that connect databases, and that may mean that to be safe, you need to anonymize data you don’t release.

5.      Anonymizing data—but also destroying it

Data becomes far less valuable if the connections between its points become corrupted or weakened. A poorly executed anonymization process can lead to data that has no value whatsoever. Of course, it’s not always oblivious that this is the case. A casual examination wouldn’t reveal anything wrong, leading companies to draw false conclusions from their data analysis.

That means that a good anonymization process should protect user data and do it in a way where you can be confident that the final results will be what you need.

6.      Applying the same anonymization technique to all problems

Sometimes when we have a problem, our natural reaction is to use a solution that worked in the past for a similar problem. However, as you can see from all the examples we’ve explored, the right solution for securing data varies greatly based on what you’re securing, why you’re securing it, and your ultimate plans for that data.

Using the same technique repeatedly can leave you more vulnerable to reverse engineering. Worse, it means that you’re not maximizing the value of each dataset and are possibly over- or under-securing much of your data.

Wrapping Up

Understanding your data is the key to unlocking its potential and keeping PII safe. Many of the issues we outlined in this article do not stem from a lack of technical prowess. Instead, the challenge of dealing with millions or even billions of discrete data points can easily turn a quick project into one that drags out for weeks or months. Or worse, projects can end up “half-completed,” weakening data analysis and security objectives.

Most companies need a program that can do the heavy lifting for them. Mage helps organizations find and catalog their data, including highlighting Personally Identifiable Information. Not only is this process automated, but it also uses Natural Language Processing to identify mislabeled PII. Plus, it can help you mask data in a static or a dynamic fashion, ensuring you’re anonymizing data in the manner that best fits your use case. Schedule a demo today to see what Mage can do to help your organization better secure its data.

BLOG LIBRARY >