July 26, 2022
The Comparative Advantages of Encryption vs. Tokenization vs. Masking
Any company that handles data (especially any company that handles personal data) will need a method for de-identifying (anonymizing) that data. Any technology for doing so will involve trade-offs. The various methods of de-identification—encryption, tokenization, and masking—will navigate those trade-offs differently.
This fact has two important consequences. First, the decision of which method to use, and when, has to be made carefully. One must take into consideration the trade-offs between (for example) performance and usability. Second, companies that traffic in data all the time will want a security solution that provides all three options, allowing the organization to tailor their security solution to each use case.
We’ve previously discussed some of the main differences among encryption, tokenization, and masking; the next step is to look more closely at these trade-offs and the subsequent use cases for each type of anonymization.
The Security Trade-Off Triangle
Three of the main qualities needed in a data anonymization solution are security, usability, and performance. We can think of these as forming a triangle; as one gets closer to any one quality, one is likely going to have to trade off the other two.
Security (Data Re-Identification)
Security is, of course, the main reason for anonymizing data in the first place. The way in which the various methods differ is in the ease with which data can be de-anonymized—that is, how easy it is for a third party to take a data set and re-identify the items in that set.
A great example of such re-identification came from a news story several years ago, where data from a New York-based cab company was released according to the Freedom of Information Act. That data, which included over 173 million individual trips and information about the cab driver, had been anonymized using a common technique called hashing. A third party was able to prove that the data could be very easily re-identified—and with a little work, a clever hacker could even infer things like individual cab drivers’ salaries and where they lived.
A good way to measure the relative security of a process like encryption, tokenization, or masking, then, is to assess how difficult re-identification of the data would be.
The more that a bit of data can be changed, the less risk there is for re-identification. But this also means that the pieces of data lose any kind of relationship to each other, and hence any pattern. The more the pattern is lost, the less useful that data is when doing analysis.
Take a standard 9-digit Social Security number, for example. We could replace each digit with a single character, say XXXXXXXX or 999999999. This is highly secure, but a database full of Xs will not reveal any useful patterns. In fact, it won’t even be clear that the data are numeric.
Now consider the other extreme, where we simply increase a single digit by 1. Thus, the Social Security number 987 65 4321 becomes 987 65 4322. In this case, much of the information is preserved. Each unique Social Security number in the database will preserve its relations with other numbers and other pieces of data. The downside is that the algorithm is easily cracked, and the data becomes easily reversible.
This is a problem for non-production environments, too. Sure, one can obtain test data using pseudo values generated by algorithms. But even in testing environments, one often needs a large volume of data that has the same complexity of real-world data. Pseudo data simply does not have that kind of complexity.
Security happens in the real world, not on paper. Any step added to a data process requires compute time and storage. It is easy for such costs to add up. Having many servers running to handle encryption, for example, will quickly become costly if encryption is being used for every piece of data sent.
How Do Encryptions, Tokenization, and Masking Compare?
Again, setting the technical details aside for the moment, the major differences among these methods is the way in which they navigate the trade-offs in this triangle.
Encryption is best suited for unstructured fields (though it also supports structured), or for databases that aren’t stored in multiple systems. It is also commonly used for protecting files and exchanging data with third parties.
With encryption, performance varies depending on the time it takes to establish a TCP connection, plus the time for requesting and getting a response from the server. If these connections are being made in the same data center, or to servers that are very responsive, performance will not seem that bad. Performance will degrade, however, if the servers are remote, unresponsive, or simply busy handling a large number of requests.
Thus, while encryption is a very good method for security of more sensitive information, performance can be an issue if you try to use encryption for all your data.
Tokenization is similar to encryption, except that the data in question is replaced by a random string of values (a token) instead of modified by an algorithm. The relationship between the token and original data is preserved in a table on a secure server. When the original data is needed, the application looks up the original relationship between the token and the original data.
Tokenization always preserves the format of the data, which helps with usability, while maintaining high security. It also tends to create less of a performance hit compared to encryption, though scaling can be an issue if the size of the lookup table becomes too large. And unlike encryption, sharing data with outside parties is tricky because they, too, would need access to the same table.
There are different types of masking, so it is hard to generalize across all of them. One of the more sophisticated approaches to masking is to replace data with pseudo data that nevertheless retains many aspects of the original data, so as to preserve its analytical value without much risk of re-identification.
When done this way, masking tends to require fewer resources than encryptions, but retains the highest data usability.
Choosing on a Case-by-Case Basis
So which method is appropriate for a given organization? That depends, of course, on the needs of the organization, the resources available, and the sensitivity of the data in question. But there need not be a single answer; the method used might vary depending on the specific use case.
For example, consider a simple email system residing on internal on-premise servers. Encryption might be appropriate for this use as the data are unstructured, the servers are nearby and dedicated for this use, and the need for security might well be high for some communications.
But now consider an application in a testing environment that will need a large amount of “real-world-like” data. In this case, usability and performance are much more important, and so masking would make more sense.
And all of this might change if, for example, you find yourself having to undergo a cloud migration.
The way forward for larger organizations with many and various needs, then, is to find a vendor that can provide all three and help with applying the right techniques in the right circumstances. Here at Mage, we aim to gain an understanding of our clients’ data, its characteristics, and its use, so we can help them protect that data appropriately. For more about our anonymization and other security solutions, you can download a data sheet here.