September 9, 2021
Limitations of Native and Open Source Anonymization Tools
Intro to Open Source Anonymization/Masking Tools
There are plenty of reasons why an organization might need to mask its data, in both production and non-production (or pre-production) environments. Unfortunately, masking (or other forms of data anonymization) is too often an afterthought, with developers looking for either “native” masking solutions included with their current database tools (Oracle, for example), turning to open-source solutions, or trying to write their own SQL scripts to get the job done.
Going with a native or open-source solution is a much different beast than onboarding a fully developed solution from a data security provider. It starts with a philosophical difference—after all, a data security provider focuses primarily on data security, naturally, while other solutions are mere “means to an end.” (Oracle, for example, wants to sell data storage. Open-source scripts might be written by people with any number of goals in mind.)
The difference is not just philosophical. Those different goals show up in the different ways of constructing a data anonymization solution. So, while a native or open-source solution might be fine for a low-cost solution solely for testing purposes, a truly scalable enterprise solution is well worth the investment for organizations serious about data security.
The Three Limitations of Native and Open-Source Anonymization Solution
In our research, we’ve found that native and open-source solutions, as a group, suffer from three big limitations. These limitations might not be obvious at first, but they will eventually have a large impact on the organization. Those limitations have to do with scalability across multiple databases, flexibility, and analytics access.
i) Scalability Across Multiple Databases
As an organization grows, there will be many different databases needing to exchange and store data that will require anonymization. It is not uncommon, for example, to have different databases in Oracle and PostgreSQL, both of which need to connect with Amazon AWS and Salesforce, all in the same organization. Add in other partners (channel partners that need to access a central customer database, for example) and the problem becomes even more complex.
Most native solutions that come with a given database work only for the data in that database. If you have Oracle’s native masking solution, for example, it works well enough, but only for that Oracle implementation. If you need to get that data out of the Oracle database and into another, there is no guarantee that the masking method will be compatible—and if the data has to be unmasked, well, that sort of defeats the purpose.
This creates a huge issue when it comes to referential integrity. Take a bit of personal data—a first name, say— that exists in multiple databases. A single entry with the name “Liam” might be masked with another name (“Ajay”) in a given database, or even a set of random characters (“XY6K”). (This is a process known as redaction.) But another SQL database is masking that same entry with a different set of characters (“Paul,” for example). There is no way, then, to compare entries across databases and make the match between the two entries. This is a serious roadblock when it comes to data sharing and analytics.
Data masking is a potent tool for protecting critical data, but it is not the only one. Encryption and tokenization are also valid methods for de-identifying data, and all three are subtly different. There are also variations on these three methods, which means there are many options out there.
What happens if you need one option specifically? Or more than one option? Remember, a data anonymization tool that comes from an organization that is not a security organization is, in essence, an “add on.” The company is trying to sell, or do, something else. This means that the tools you will be offered for data de-identification will be minimal.
To give you some idea of the contrast, MENTIS offers over 70 different anonymization methods, including options using all three methods (masking, encryption, and tokenization). This gives organizations the flexibility to choose different methods for different use cases. For example, encryption would work for very sensitive data with a high data value, while masking will have much of an impact on performance.
iii) Analytics Access
Many open-source masking solutions will anonymize data using a meaningless mask—for example, the name “Samantha” becomes a series of random characters or a sequence of Xs. When the series of tokens is meaningless, you lose the ability to do any sort of interesting analysis with them.
This does not have to be the case, however. Another way to mask the data is to replace personal information with something that is comparable in the relevant dimensions, but still random. To take our “Samantha” example, this bit of personal information is a first name, most likely for a female, and most likely one living in North America. If “Samantha” were replaced with another name at random that was the same length and appropriate for a female living in North America, it would still protect the underlying personal information (as long as the name is common enough), but the token itself would still carry enough information for analytics purposes. And so replacing “Samantha” with “Jennifer” would still mask the underlying data and protect the person’s privacy, while still allowing an analyst to run analytics that grouped the data by gender and location.
This particular method works because analysts usually do not need all of the data, in every aspect; they need only a few relevant dimensions. By changing data to similar (“adjacent”) values, much of the structure—and hence, value—of the data set remains.
Again, this is a fairly modern way of approaching data masking. Most open-source and native solutions simply do not have this capability, and thus they make doing analytics on real data sets nearly impossible.
Getting Started with a Data Security Company
For years, Mage has been helping organizations with data security, including static data masking for pre-production and non-production environments (iScramble™) and dynamic data masking for production environments (iMask™).