Category: Blogs – TDM

How to Secure Your Critical Sensitive Data in Non-Production and Testing Environments
With businesses across the world embracing digital transformation projects to adapt to modern business requirements, a new challenge has emerged – the increasing usage of data for business-critical functions and protecting the sensitivity of its nature. Within the same organization itself, multiple functions use data in various ways to meet their objectives, adding a layer of complexity for data security professionals who aim to protect the exposure of any sensitive data, but also want to ensure that it does not affect performance. This sensitive data can include employee and customer information, as well as corporate confidential data and intellectual property that can cause wide ramifications by falling into the wrong hands. For organizations that depend on high-quality data for their software development processes but also want to ensure that any sensitive information contained within it is not exposed, a good static data masking tool is a crucial requirement for business operations.

Protect data in non-production environments

A critical aspect of data protection is ensuring the security of sensitive data in development, testing and training (non-production) environments, to eliminate any risk of sensitive data exposure. The same protection methods cannot be used for production and non-production environments as the requirements for both are different. In such cases, de-identifying or masking the data is recommended as a best practice for protecting the sensitive data involved. Masking techniques secure both structured and unstructured fields in the data landscape to allow for testing or quality assurance requirements and user-based access without the risk of sensitive data disclosure.

Maintain integrity of secured data

While securing data, it has also become important for organizations to balance the security and usability of data so that it is relevant enough for use in business analytics, application development, testing, training, and other value-added purposes. Good static data masking tools ensures that the data is anonymized in a manner that retains the usability of data while providing data security.

Choice of anonymization methods

Organizations will have multiple use-cases for data analysis, based on the requirements of the teams that handle this data. In such cases, some anonymization methods can prove to have more value than others depending on the security and performance needs of the relevant teams. These methods can include encryption, tokenization or masking, and good tools will offer different such methods for anonymization that can be used to protect sensitive data effectively.

For years, Mage Data™ has been helping organizations with their data security needs, by providing solutions that include static data masking tools for securing data in non-production environments (Mage Data Static Data Masking).
Some of the features of the Mage Data Static Data Masking tool are as follows:
- 70+ different anonymization methods to protect any sensitive data effectively
- Maintains referential integrity between applications through anonymization methods that gives consistent results across applications and datastores
- Anonymization methods that offer both protection and performance while maintaining its usability
- Encrypts, tokenizes, or masks the data according to the use case that suits the organization
September 8, 2022
The Comparative Advantages of Encryption vs. Tokenization vs. Masking

Any company that handles data (especially any company that handles personal data) will need a method for de-identifying (anonymizing) that data. Any technology for doing so will involve trade-offs. The various methods of de-identification—encryption, tokenization, and masking—will navigate those trade-offs differently.

This fact has two important consequences. First, the decision of which method to use, and when, has to be made carefully. One must take into consideration the trade-offs between (for example) performance and usability. Second, companies that traffic in data all the time will want a security solution that provides all three options, allowing the organization to tailor their security solution to each use case.

We’ve previously discussed some of the main differences among encryption, tokenization, and masking; the next step is to look more closely at these trade-offs and the subsequent use cases for each type of anonymization.

The Security Trade-Off Triangle

Three of the main qualities needed in a data anonymization solution are security, usability, and performance. We can think of these as forming a triangle; as one gets closer to any one quality, one is likely going to have to trade off the other two.

Security (Data Re-Identification)

Security is, of course, the main reason for anonymizing data in the first place. The way in which the various methods differ is in the ease with which data can be de-anonymized—that is, how easy it is for a third party to take a data set and re-identify the items in that set.

A great example of such re-identification came from a news story several years ago, where data from a New York-based cab company was released according to the Freedom of Information Act. That data, which included over 173 million individual trips and information about the cab driver, had been anonymized using a common technique called hashing. A third party was able to prove that the data could be very easily re-identified—and with a little work, a clever hacker could even infer things like individual cab drivers’ salaries and where they lived.

A good way to measure the relative security of a process like encryption, tokenization, or masking, then, is to assess how difficult re-identification of the data would be.

Usability (Analytics)

The more that a bit of data can be changed, the less risk there is for re-identification. But this also means that the pieces of data lose any kind of relationship to each other, and hence any pattern. The more the pattern is lost, the less useful that data is when doing analysis.

Take a standard 9-digit Social Security number, for example. We could replace each digit with a single character, say XXXXXXXX or 999999999. This is highly secure, but a database full of Xs will not reveal any useful patterns. In fact, it won’t even be clear that the data are numeric.

Now consider the other extreme, where we simply increase a single digit by 1. Thus, the Social Security number 987 65 4321 becomes 987 65 4322. In this case, much of the information is preserved. Each unique Social Security number in the database will preserve its relations with other numbers and other pieces of data. The downside is that the algorithm is easily cracked, and the data becomes easily reversible.

This is a problem for non-production environments, too. Sure, one can obtain test data using pseudo values generated by algorithms. But even in testing environments, one often needs a large volume of data that has the same complexity of real-world data. Pseudo data simply does not have that kind of complexity.

Performance

Security happens in the real world, not on paper. Any step added to a data process requires compute time and storage. It is easy for such costs to add up. Having many servers running to handle encryption, for example, will quickly become costly if encryption is being used for every piece of data sent.

How Do Encryptions, Tokenization, and Masking Compare?

Again, setting the technical details aside for the moment, the major differences among these methods is the way in which they navigate the trade-offs in this triangle.

Encryption

Encryption is best suited for unstructured fields (though it also supports structured), or for databases that aren’t stored in multiple systems. It is also commonly used for protecting files and exchanging data with third parties.

With encryption, performance varies depending on the time it takes to establish a TCP connection, plus the time for requesting and getting a response from the server. If these connections are being made in the same data center, or to servers that are very responsive, performance will not seem that bad. Performance will degrade, however, if the servers are remote, unresponsive, or simply busy handling a large number of requests.

Thus, while encryption is a very good method for security of more sensitive information, performance can be an issue if you try to use encryption for all your data.

Tokenization

Tokenization is similar to encryption, except that the data in question is replaced by a random string of values (a token) instead of modified by an algorithm. The relationship between the token and original data is preserved in a table on a secure server. When the original data is needed, the application looks up the original relationship between the token and the original data.

Tokenization always preserves the format of the data, which helps with usability, while maintaining high security. It also tends to create less of a performance hit compared to encryption, though scaling can be an issue if the size of the lookup table becomes too large. And unlike encryption, sharing data with outside parties is tricky because they, too, would need access to the same table.

Masking

There are different types of masking, so it is hard to generalize across all of them. One of the more sophisticated approaches to masking is to replace data with pseudo data that nevertheless retains many aspects of the original data, so as to preserve its analytical value without much risk of re-identification.

When done this way, masking tends to require fewer resources than encryptions, but retains the highest data usability.

Choosing on a Case-by-Case Basis

So which method is appropriate for a given organization? That depends, of course, on the needs of the organization, the resources available, and the sensitivity of the data in question. But there need not be a single answer; the method used might vary depending on the specific use case.

For example, consider a simple email system residing on internal on-premise servers. Encryption might be appropriate for this use as the data are unstructured, the servers are nearby and dedicated for this use, and the need for security might well be high for some communications.

But now consider an application in a testing environment that will need a large amount of “real-world-like” data. In this case, usability and performance are much more important, and so masking would make more sense.

And all of this might change if, for example, you find yourself having to undergo a cloud migration.

The way forward for larger organizations with many and various needs, then, is to find a vendor that can provide all three and help with applying the right techniques in the right circumstances. Here at Mage Data, we aim to gain an understanding of our clients’ data, its characteristics, and its use, so we can help them protect that data appropriately. For more about our anonymization and other security solutions, you can download a data sheet here.

July 26, 2022
A Data Protection and Data Privacy Glossary

The vocabulary of data protection seems to change every few years. Legislators pass new regulations, user expectations rise, and new technologies become available. It’s hard enough to keep up with the jargon, never mind best practices.

Those responsible for implementing data privacy solutions need to “talk the data privacy talk” before walking the walk. This helps ensure that you can be advised on the right set of solutions to solve the real problems at hand—and, hopefully, never have to guess whether or not you’re protected. This glossary provides definitions and explanations for 20+ data protection words, phrases, and concepts.

Anonymization

Data anonymization helps companies maximize the utility of data while preserving compliance. Anonymization removes personally identifiable information (PII), so the data cannot be tied to individuals if leaked or misused. Anonymizing the data eliminates privacy concerns so an organization can retain information for forecasting and other analysis. Businesses must avoid the most common data anonymization mistakes to keep their user information private.

Big Data

Big Data refers to data sets that are too large or complex for traditional data software solutions. Organizations are receiving and retaining increasing volumes of data, and modern data sets contain a much larger variety of information.

California Consumer Protection Act (CCPA)

The CCPA is one of the most significant data privacy regulations. The regulation was signed into law in June of 2018 and went into effect at the beginning of 2020. The CCPA gives users increased data privacy rights, and it’s changing the ways businesses in the United States collect and use information.

California Privacy Rights Act (CPRA)

The CPRA was passed in November 2020 and goes into effect at the beginning of 2023. It extends the CCPA, bringing additional protections for consumer information and increasing fines for violations. The regulation applies to all companies doing business in California or with customers within the state.

Database Activity Monitoring (DAM)

Database activity monitoring is a security technology for detecting fraudulent, illegal, or otherwise inappropriate behavior within a database. DAM gives security professionals the ability to monitor access to sensitive data in real time. Immediate, ongoing reporting helps keep an organization audit-ready.

Data Minimization

This principle means an organization must limit the collection of personal information to what is relevant and necessary. Furthermore, organizations should retain information only for as long as needed to satisfy a specific purpose. The GDPR (defined below) was one of the first to establish guidelines for data minimization.

Data Obfuscation

This term is often used interchangeably with data masking. Data obfuscation is the process of modifying sensitive data to protect the privacy of individuals. The process eliminates opportunities for hackers or other unauthorized parties to derive value from the data. At the same time, data obfuscation techniques can preserve the utility of data for authorized parties and software.

Data Privacy

Data privacy has to do with collecting, storing, and using data responsibly. Data privacy efforts focus on ensuring that only the appropriate parties have access to information. Explore the differences between data privacy and data protection to gain a deeper understanding of each.

Data Protection or Data Security

People often use data protection and data security interchangeably. These terms refer to strategies for ensuring the availability and integrity of data while guarding against threats. While there is some overlap with compliance, it’s worth noting that compliance with regulations is not the same as complete data security.

Data Retention

The principle of data retention outlines procedures for meeting requirements around data archiving and management. Organizations must store some information for specified periods to comply with government or industry regulations. Occasionally, there is tension between data retention and data privacy.

Data Scrambling

This method of obfuscating or removing confidential data is irreversible. Data scrambling techniques involve the generation of randomized strings that cannot be restored to the original information.

Data Scrubbing

Also known as data cleaning or data cleansing, data scrubbing is the process of fixing erroneous information within a data set. Examples that require scrubbing include incomplete, incorrect, and duplicate data. Data scrubbing is a two-step process. First, identify errors in the data set. Then change, update, or remove data as needed to correct issues.

Data Subject Access Rights

The right of subject access says individuals are entitled to obtain copies of their data. Technologies like data subject access rights automation help organizations respond to requests more efficiently.

De-identification

De-identification of data is a type of dynamic data masking. This process involves stripping identifiers from collected data. Removing links between data and personal identities helps protect the privacy of individuals.

Encryption

Encryption is the process of encoding data to protect the information from unauthorized access. Typically, an algorithm will turn plaintext data into unreadable ciphertext. This helps when sharing data with third parties, which may then decrypt the information with the decryption key.

General Data Protection Regulation (GDPR)

Adopted in 2016 and effective in May 2018, the GDPR is a model for many other data privacy laws. The regulation is part of EU privacy law and human rights law. The GDPR gives individuals more control over their personal data and supersedes other data protection regulations for international business.

Health Insurance Portability and Accountability Act (HIPAA)

This act, passed in 1996, is a federal law in the United States. It established national standards to prohibit the disclosure of sensitive health information without the patient’s disclosure.

Homomorphic Encryption

is a specialized type of encryption designed for data in use. Typically, encrypted data is transferred, decrypted, and then analyzed. Homomorphic encryption allows data to be valuable without being decrypted first.

Masking

There are multiple types of data masking. Static data masking techniques like tokenization and encryption protect data in pre-production and non-production environments. Dynamic masking protects data in production environments when it’s in transit or in use.

Personally Identifiable Information (PII)

PII is any personal data that relates to an identifiable person. Information such as names, addresses, and Social Security numbers are PII because they can directly identify an individual. Combinations of other information such as age, race, gender, birth date, and more can also be PII.

Protected Health Information (PHI)

As defined by HIPAA, PHI is any data related to an individual’s health. PHI also includes the healthcare provided to an individual or payment by the individual for said healthcare. PHI is a top consideration when developing.

Personal Data Protection Act (PDPA)

The PDPA is a piece of data protection legislation from Singapore. It passed in 2012 and regulates the way organizations in the private sector can process personal data.

Privacy-Enhancing Technologies (PETs)

PETs are technologies to maximize data privacy while empowering individuals. These technologies help organizations get more from their data without compromising privacy or security.

Pseudonymization

Pseudonymization is the process of replacing sensitive data with a reversible, consistent value. However, this brings increased risks of reidentification.

Reidentification

The phenomenon of having personal data extracted or inferred from a source, usually as the result of bad actors attempting to steal that data. For example, a classic case of reidentification occurred when New York City released data on taxi travel, but formatted the data in such a way that it was trivial to recover personally identifiable information of drivers, such as income, home address, etc.

Sensitive Data Discovery

As organizations store increasing volumes of data, it becomes crucial to discover sensitive information that may be hidden or forgotten. Sensitive data discovery is the first step in any data privacy and data security strategy. After all, you can’t protect what you don’t know.

SOX Compliance

Outlined by the Sarbanes-Oxley (SOX) Act, involves annual auditing of public companies for accuracy and security in their financial reporting. To achieve SOX compliance, companies must keep data secure, track attempted breaches, keep event logs, and prove compliance for the most-recent 90-day period.

Tokenization

Like encryption, tokenization replaces plaintext data with an algorithm-generated value or string of values. In tokenization, the original data is retained in a secure server. The generated token can be passed to that secure server to retrieve the original information.

Getting Started With Data Privacy and Data Security

To see how data protection and data privacy concepts fit into a comprehensive product suite, schedule a demo with Mage Data. We’re happy to address your specific questions and tailor the demo to fit your requirements.

July 21, 2022
What is a Zero-Trust Security Model?
Traditional computer security models ensure that people without the proper authorization cannot access an organization’s network. However, a single set of compromised login credentials can lead to a breach of the entire network.

A Zero-Trust Security Model goes some way to solving this problem by requiring users to continually verify their identity, even if they’re already inside the secure digital perimeter. This approach restricts users to the minimum amount of information necessary to do their job. In the event of a breach, hackers will find it difficult or impossible to move laterally through a network and gain access to more information.

A Zero-Trust Security Model doesn’t mean that you don’t trust the people you’re sharing data with. Instead, a zero-trust security model implements checkpoints throughout a system so you can be confident that your trust in each user is justified.

What is a Zero-Trust Security Model?

Imagine for a moment that a computer network is like a country. In a traditional security model, the country would have border checkpoints around its perimeter. Employees who presented the correct login info would be allowed to enter, and bad actors trying to gain access would be kept outside.

While this is a good idea in theory, in practice, problems emerge. For example, bad actors who breached the perimeter would get much or all of the information inside the network. Likewise, employees who are past the first barrier may gain access to documents or other information that they shouldn’t see.

These problems with the traditional model of cybersecurity drove the U.S. Department of Defense to adopt a new strategy in the early 2000s. Those responsible for network security treated their systems as though they had already been breached, and then asked the question: “Given that the system has been breached, how do we limit the collateral damage?”

To meet that objective, they developed an approach that required users, consisting of both humans and machines, to continually prove that they were allowed to be present every time they attempted to access a new resource. To return to our metaphor from earlier, employees would have to show ID at the country’s border, and show ID every time they tried to access a new building, which in this example represents resources within the system. This approach meant that bad actors would find it harder to move through the system with a single breach, and also made it easy to restrict employees to the appropriate areas in the network based on their security clearance.

Zero-Trust Security Comes of Age

The external and internal benefits of a Zero-Trust Security Model quickly became clear to the private sector, too. While many businesses adapted the system for their own use, or offered it as a service to others, it wasn’t until August 2020 that the National Institute of Standards and Technology (NIST) released the first formal specification for Zero-Trust Security Model implementation.

NIST Special Publication 800-207 details how to implement a Zero-Trust Architecture (ZTA) in a system. The Seven Tenets of Zero Trust form the core of this approach.
1. All data sources and computing services are resources
2. All communication is secured regardless of network location
3. Access to individual enterprise resources is granted on a per-session basis
4. Access to resources is determined by a dynamic policy
5. The enterprise monitors and measures the integrity and security posture of all owned and associated assets
6. All resource authentication and authorization are dynamic and strictly enforced before access is allowed
7. The enterprise collects as much information as possible about the current state of assets, network infrastructure, and communications and uses that information to improve its security posture
Of these seven tenets, two especially speak to what’s different between ZTA and more traditional approaches. Session-based access (#3) means that access permissions are reevaluated each time a new resource is accessed, or if sufficient time has passed between access requests to the same resource. This approach reduces the potential for bad actors to exploit lost devices or gain access through an unattended workstation.

Dynamic policy controls (#6) look beyond user credentials, such as a username and password. For example, a dynamic policy may also consider other factors such as the type of device, what network it is on, and possibly previous activity on the network to determine if the request is legitimate. This kind of observation improves detection of external malicious actors, even when the correct login credentials are provided.

Access control is run through a Policy Decision Point. The Policy Decision Point is composed of a Policy Engine, which holds the rules for granting access, and the Policy Administrator, which carries out the allowance or disallowance of access to resources when a request is made.

Benefits of Zero-Trust Security

Many powerful benefits emerge when a system is set up to align with ZTA standards. Arguably, the most important of these is the compartmentalization of system resources. When resources are compartmentalized, hackers who gain access to one area of your network won’t gain access to other resources. For example, a breached email account wouldn’t give the hacker access to your project documentation or financial systems.

Compartmentalization also holds benefits for managing your employees. With a compartmentalized system, you won’t have to and shouldn’t give your employees access to more resources than they need to do their jobs. This approach reduces the risk of the employee intentionally or accidentally viewing sensitive information. Compartmentalization also minimizes the damage done by leaks, as employees generally won’t have access to documentation beyond their immediate needs.

Because a core policy of ZTA is the continuous collection of data about how each user behaves on the network, it becomes far easier to spot breaches. In many cases, organizations with ZTA systems detect breaches not because of failed authentication but rather because a feature of the access request, such as location, time, or type of resource requested, differs from regular operation and is flagged by the Policy Decision Point. For example, a request for a resource from Utah to a server for a company based in Virginia would raise flags, even if a bad actor provided a valid username and password.

Zero-Trust Security Model Integration

While Zero-Trust Security Models hold many benefits for many companies, it’s essential to acknowledge that it’s not a “plug-and-play” system. The approach differs significantly from traditional security practices. Most companies will need a total overhaul of their network to apply it. That can be a disruptive process and will likely lower productivity in the short term as new systems are implemented, and employees adapt to the new policies.

That doesn’t make moving to a Zero-Trust system the wrong choice, but it does mean that the transition has some tradeoffs. However, if you’re looking for the absolute best industry standard for security, Zero-Trust is the way to go.

If you’re contemplating increasing your security, you need to know exactly what data you’ll be securing. Mage Data helps organizations find and catalog their data, including highlighting Personally Identifiable Information, which you’d want to provide an extra layer of security to in a Zero-Trust system. Schedule a demo today to see what Mage Data can do to help your organization better secure its data.
July 14, 2022
6 Common Data Anonymization Mistakes Businesses Make Every Day

Data is a crucial resource for businesses today, but using data legally and ethically often requires data anonymization. Laws like the GDPR in Europe require companies to ensure that personal data is kept private, limiting what companies can do with personal data. Data anonymization allows companies to perform critical operations—like forecasting—with data that preserves the original’s characteristics but lacks the personally identifying data points that could harm its users if leaked or misused.

Despite the importance of data anonymization, there are many mistakes that companies regularly make when performing this process. These companies’ errors are not only dangerous to their users, but could also subject them to regulatory action in a growing number of countries. Here are six of the most-common data anonymization mistakes that you should avoid.

1. Only changing obvious personal identification indicators

One of the trickiest parts of anonymizing a dataset is determining what is or isn’t Personally Identifiable Information (PII) is the kind of information you want to ensure is kept safe. Individual information like date of purchase or the amount paid may not be personal information, but a credit card number or a name would be. Of course, you could go through the dataset by hand and ensure that all relevant data types are anonymized, but there’s still a chance that something slips through the cracks.

For example, if data is in an unstructured column, it may not appear on search results when you’re looking for PII. Or a benign-looking column may exist separately in another table, allowing bad actors to reconstruct the original user identities if they got access to both tables. Small mistakes like these can doom an anonymization project to failure before it even begins.

2. Mistaking synthetic data for anonymized data

Anonymizing or “masking” data takes PII in datasets and alters it so that it can’t be traced back to the original user. Another approach to data security is to instead create “synthetic” datasets. Synthetic datasets attempt to recreate the relationships between data-points in the original dataset while creating an entirely new set of data points.

Synthetic data may or may not live up to its claims of preserving the original relationships. If it doesn’t, it may not be useful for your intended purposes. However, even if the connections are good, treating synthesized data like it’s anonymized or vice versa can lead to mistakes in interpreting the data or ensuring that it is properly stored or distributed.

3. Confusing anonymization with pseudonymization

According to the EU’s GDPR, data is anonymized when it can no longer be reverse engineered to reveal the original PII. Pseudonymization, in comparison, replaces PII with different information of the same type. Pseudonymization doesn’t guarantee that the dataset cannot be reverse engineered if another dataset is brought in to fill in the blanks.

Consequently, anonymized data is generally exempted from GDPR. Pseudonymization is still subject to regulations, albeit reduced relative to normal data. Companies that don’t correctly categorize their data into one bucket or the other could face heavy regulatory action for violating the GDPR or other data laws worldwide.

4. Only anonymizing one data set

One of the common threats we’ve covered so far is the threat of personal information being reconstructed by introducing a non-anonymized database to the mix. There’s an easy solution to that problem. Instead of anonymizing only one dataset, why not anonymize all of the ones that share data. That way, it would be impossible to reconstruct the original data.

Of course, that’s not always going to be possible in a production environment. You may still need the original data for a variety of reasons. However, suppose you’re ever anonymizing data and sending it beyond the bounds of your organization. In that case, you have to consider the variety of interconnections that connect databases, and that may mean that to be safe, you need to anonymize data you don’t release.

5. Anonymizing data—but also destroying it

Data becomes far less valuable if the connections between its points become corrupted or weakened. A poorly executed anonymization process can lead to data that has no value whatsoever. Of course, it’s not always oblivious that this is the case. A casual examination wouldn’t reveal anything wrong, leading companies to draw false conclusions from their data analysis.

That means that a good anonymization process should protect user data and do it in a way where you can be confident that the final results will be what you need.

6. Applying the same anonymization technique to all problems

Sometimes when we have a problem, our natural reaction is to use a solution that worked in the past for a similar problem. However, as you can see from all the examples we’ve explored, the right solution for securing data varies greatly based on what you’re securing, why you’re securing it, and your ultimate plans for that data.

Using the same technique repeatedly can leave you more vulnerable to reverse engineering. Worse, it means that you’re not maximizing the value of each dataset and are possibly over- or under-securing much of your data.

Wrapping Up

Understanding your data is the key to unlocking its potential and keeping PII safe. Many of the issues we outlined in this article do not stem from a lack of technical prowess. Instead, the challenge of dealing with millions or even billions of discrete data points can easily turn a quick project into one that drags out for weeks or months. Or worse, projects can end up “half-completed,” weakening data analysis and security objectives.

Most companies need a program that can do the heavy lifting for them. Mage Data helps organizations find and catalog their data, including highlighting Personally Identifiable Information. Not only is this process automated, but it also uses Natural Language Processing to identify mislabeled PII. Plus, it can help you mask data in a static or a dynamic fashion, ensuring you’re anonymizing data in the manner that best fits your use case. Schedule a demo today to see what Mage Data can do to help your organization better secure its data.

April 29, 2022
Database – Embedded Approach to Dynamic Data Masking – Use Cases

A bank wants to outsource some analytics functions to an outside company. A military contractor wants to limit sensitive information to those at a certain pay grade. An HR department wants to run efficiently without leaking sensitive employee data. What all these cases have in common is a need to make certain bits of information available to those that need it—and to keep private all the rest.

These are just a few examples of business needs that can be addressed by Dynamic Data Masking (DDM). DDM is a relatively newer technology that allows for the masking of production data in real-time, as the data is requested. Sensitive information is selectively “hidden” from users based on geography, role, department, and so on. Organizations can specify which bits of sensitive data to reveal, and to whom, without any serious changes to the database of the application layer.

Perhaps the biggest advantage of dynamic data masking is its impact on performance—or rather, its lack of impact. DDM is widely preferred on production systems precisely because it does not introduce huge delays in data retrieval, if a database approach is taken (as opposed to using a proxy). It also does not require changing anything in application architecture, which can save on expensive and time-consuming release cycles.

Here, then, are some prime examples of DDM being used in organizations to maintain privacy and security. (Note that, for privacy reasons, we are presenting hypothetical cases that are merely based on customer circumstances. We have purposely left out any identifying information, of course.)

Offshoring Analytics Work for a U.S. Bank

A U.S.-based bank wants to find ways to control costs. As part of that effort, they decide to offshore some of their development and analytics work to firms located in Europe and India. However, U.S. law stipulates that sensitive account information cannot leave the United States.

That said, the bank does not really need to send the data itself. It is interested in the analytics, which gives summary statistics and relationships among data items. This is a case where a sophisticated masking approach works well: Account data can be masked with “dummy data” that nevertheless preserves the relationships among items in the data.

For example, take address data. The bank might be interested in discovering how many services are being used by bank patrons in various ZIP codes. They do this by cross-referencing addresses tied to accounts with services being provided to each account. When sending the data, the actual addresses can be masked with fake addresses that nevertheless group together clients who are in like ZIP codes. The result is that the analytics results stay the same, but no actual addresses are transmitted, and no data leaves U.S. borders.

And because the data is physically and logically within U.S. borders, U.S. law is respected, even though a non-U.S. firm is providing the analytics insights.

Restricting Private Information for Sensitive Contracts

A top aviation company in Canada is working closely with Canada’s military, and data needs to be shared back and forth between the organizations. Certain bits of information—such as names, addresses, salaries, and designations—are considered private information, and access to that information is determined by pay grade. Some pay grades have access to all information; some can access only information for their own pay grade and lower. Some will not be able to see any private information at all.

With dynamic data masking, sensitive information will be masked according to who is requesting the information of the system. This way, data integrity is kept intact at all times, but only those at certain pay grades will be able to see the most sensitive data, according to the rules set forth when the masking is implemented.

Role-Based Access for HR within a Large Organization

An enterprise-sized business has a large amount of employee data for things like payroll, employee communications, etc. While employees can access their own data via a company portal, no one outside of HR needs to access that information.

Even within HR, only certain pieces of information are needed by people in certain roles. In essence, the organization wants to allow for role-based account access (RBAC). For example, the person running payroll needs to be able to access salary information and residential addresses. But the person sending communications about benefits needs only the address information—they should not see salaries.

When employee information is requested, certain bits of information are masked depending on what is actually needed by the person making the request. This allows them to have access to what they need without introducing measures that slow down the system.

Mage Data Provides a Faster Dynamic Data Masking Option

There are many vendors that will offer dynamic data masking, but they do so by routing requests through a security proxy. While this works, it means there can be a significant effect on performance, because the speed of the system now depends on the speed of that proxy.

Now consider what that proxy has to do with each request: It must process the query, look at the user’s access level, mask the incoming data when it is returned (according to user access levels), and send it back to the requester. While a handful of requests might be handled quickly, it is easy for such requests to pile up…and when that happens, there is a noticeable impact on system performance.

With Dynamic Data Masking tool masking rules and policies are pre-programmed into the database itself. There is no proxy serving as a “middleman”: The database itself becomes smart, knowing which pieces of data to serve as-is, and which to mask. This means there is a minimal impact on performance. Dynamic Data Masking works for organizations of any size and allows an incredible amount of customization when it comes to access controls and anonymization methods.

January 6, 2022
5 Common Mistakes Organizations Make During Data Obfuscation

What is Data Obfuscation?

As the name suggests, data obfuscation is the process of hiding sensitive data with modified or other data to secure it. Many are often confused by the term data obfuscation and what it entails, as it is a broad term used for several data security techniques such as anonymization, pseudonymization, masking, encryption, and tokenization.

The need for data obfuscation is omnipresent, with companies needing to achieve business objectives such as auditing, cross-border data sharing, and the like. Apart from this, the high rate of cybercrime is also a pressing reason for companies to invest in technology that can help protect their data, especially now, given the remote working condition due to the Covid pandemic.

Let’s look at some of the best practices you can follow for data obfuscation:

1) Understand your options

It is vital to understand the difference between different data obfuscation techniques such as anonymization and pseudonymization, and encryption, masking, and tokenization. Unless you’re knowledgeable about the various methods of data security and their benefits, you cannot make an informed choice to fulfill your data security needs.

2) Keep in mind the purpose of your data

Of course, the need of the hour is to secure your data. But every data element has a specific purpose. For example, if the data is needed for analytical purposes, you cannot go ahead with a simple encryption algorithm and expect good results. You need to select a technique, such as masking, that will preserve the functionality of the data while ensuring security. The method of obfuscation chosen should facilitate the purpose for which your data is intended.

3) Enable regulatory compliance

Of course, data security is a broader term when compared to compliance, but does being secure mean you’re compliant too? Data protection standards and laws such as HIPAA, PCI DSS, GDPR, and CCPA are limited to a defined area and aim to secure that particular information. So, it is imperative to figure out which of those laws you are required to comply with and implement procedures in place to ensure the same. Security and compliance are not the same – ensure both.

4) Follow the principle of least privilege

The principle of least privilege is the idea that any user, program, or process should have only the bare minimum privileges necessary to perform its function. It works by allowing only enough access to perform the required job. Apart from hiding sensitive data from those unauthorized, data obfuscation techniques like Dynamic Data Masking can also be used to provide user-based access to private information.

5) Use repeatable and irreversible techniques

For the most part, wherever applicable, it would be advisable to use reliable techniques that produce the same results every time. And even if the data were to be seized by a hacker, it shouldn’t be reversible.

Conclusion:

While data obfuscation is important to ensure the protection of your sensitive data, security experts must ensure that they do not implement a solution just to tick a check box. Data Security solutions, when implemented correctly can go a long way to save millions of dollars in revenue for the organization.

September 2, 2021
Reidentification Risk of Masked Datasets: Part 2
This article is a continuation of Reidentification Risk of Masked Datasets: Part I, where we discussed how organizations progressed from simple to sophisticated methods of data security, and how even then, they faced the challenge of reidentification. In its conclusion, we shared what companies really need to focus on while anonymizing their data.

Now, we delve into the subject of reidentification and how to go about achieving your goal, which is ultimately to reduce or eliminate the risk of reidentification.

Before we dive further into this, it helps to understand a few basic concepts.

What is reidentification and reidentification risk?

A direct or an indirect possibility that the original data could be deciphered depends on the dataset and the method of anonymization. This is called reidentification, and the associated risk is appropriately named reidentification risk.

The NY cab example that we saw in Part 1 of this article is a classic case of reidentification, where the combination of indirectly related factors led to the reidentification of seemingly anonymized personal data.

Understanding the terms data classification and direct identifiers

Data classification or an identifier is any data element that can be used to identify an individual, either by itself or in combination with another element, such as name, gender, DOB, employee ID, SSN, age, phone number, ZIP code and so forth. Certain specific data classifications — for instance, employee ID, SSN and phone number — are unique or direct identifiers (i.e., they can be used to uniquely identify an individual). Name and age, on the other hand, are not unique identifiers since there’s repeatability in a large dataset.

Understanding indirect identifiers and reidentification risk through a simple example

Let’s say you take a dataset of 100 employees and you’re tasked with finding a specific employee in her 40s. Assume that all direct identifiers have been anonymized. Now you look at indirect identifiers, such as race/ethnicity, city or, say, her bus route — and sure enough, you’ve identified her. Indirect identifiers depend on the dataset and, therefore, are distinct for different datasets. Even though the unique identifiers have been anonymized, you can’t say for sure that an individual can never be identified given that every dataset carries indirect identifiers, leading to the risk of reidentification.

What are quasi-identifiers?

Quasi-identifiers are a combination of data classifications that, when considered together, will be able to uniquely identify a person or an entity. As previously mentioned, studies have found that the five-digit ZIP code, birth date and gender form a quasi-identifier, which can uniquely identify 87% of the American population.

Now that we are on the same page with essential terms, let’s get to the question: How do you go about choosing the right solution that minimizes or eliminates the risk of reidentification while still preserving the functionality of the data?

The answer lies in taking a risk-versus-value approach.

First, determine the reidentification risk carried by each identifier in your dataset. Identify its uniqueness and whether it is a direct or indirect identifier, then look at the possible combinations of quasi-identifiers. How high of a risk does a particular identifier pose, either on its own or combined with another piece of data? Depending on how big and diverse a dataset is, there are likely to be quite a few identifiers that can, singly or in combination, reidentify someone. If we assume that is unique and anonymize each based on the risk it poses, regardless of what was done to another data element, you start looking at anonymized datasets that are all data-rich but have removed the possibility of getting back the original data. 

In other words, this approach maintains demographical logic but not absolute demographics. Now that we’ve preserved data functionality, let’s address eliminating reidentification risk. Remember the quasi-identifier (ZIP Code, DOB and gender) that can uniquely identify 87% of the U.S. population? We can address this issue the same way, by maintaining demographical logic in the anonymized data. For example, Elsie, who was born on 1/21/76, can be anonymized to Annie born on 1/23/76.

Notice that: 
1. The gender remains the same (while the name is changed with the same number of characters).
2. The ZIP code remains the same (while the street is changed).
3. The date of birth changed by two days (while the month and year remain the same).
This dataset maintains the same demographic data, which is ideal for analytics, without giving away any PII and, at the same time, plans ahead for and eliminates the risk of reidentification.

A practical solution lies in keeping the ways in which data can be reidentified in mind when applying anonymization methods. The right approach maintains the dataset’s richness and value — for the purpose it has been originally intended while minimizing or eliminating the reidentification risk of your masked datasets — an approach that makes for a comprehensive and effective data security solution for your sensitive data.

By taking a risk-based approach to data masking, you can be assured that your data has been truly anonymized.
November 12, 2020
Reidentification Risk of Masked Datasets: Part 1

The following article is also hosted on Forbes Tech Council.

Data masking, anonymization and pseudonymization aren’t new concepts. In fact, as the founder of a data security company, I’ve had 15 years of experience in helping my clients drive business decisions, enable transactions, allow for secure data sharing in a B2B or consumer scenario, or meet audit and financial requirements.

One constant in my experience has been the issue of data reidentification. In many instances, companies take a one-size-fits-all easy route to data security, such as using Pretty Good Privacy (PGP) and thinking that 95% of the world won’t be able to reverse it only to find that this assumption is incorrect.

Take, for example, cases in which customers go into an ERP application and change all Social Security numbers to the most popular combination—999 99 9999. This is highly secure, of course, but renders the data nonfunctional.

On the other hand, by staying too close to the data, you may ensure that the data performs well, but the trade-off ends up compromising security. For instance, you can add 1 to the Social Security number 777 29 1234, and it becomes 777 29 1235. Or you can add it to the digits in between, and it becomes 777 30 1234. In both cases, though, the data functionality is preserved at the cost of its security. The algorithm can be cracked, and the data is easily reversible.

Some companies have even kicked the tires on synthetic data (pseudo values generated by algorithms), but that also negatively impacts testing since non-production environments need large volumes of data that carry the complexity of real data in order to test effectively and build the right enhancements. As the name suggests, synthetic data is in no way connected to reality.

Theoretically speaking, it may seem like a great way to go about securing your information, but no machine algorithm can ever mimic the potency of actual data an organization accrues during its lifespan. Most projects that rely on synthetic data fail for the simple reason that it doesn’t allow for complexity in the interaction between data because, unlike actual data, it’s too uniform.

There’s always a trade-off: Either your data is highly secure but low in functionality or vice versa. In search of alternatives, companies move to more sophisticated methods of data protection, such as anonymization and masking, to ensure data security in regard to functionality and performance.

There is a catch, however, and it’s the reason why it’s so important that proper and complete anonymization and masking are critical: Cross-referencing the data with other publicly available data can reidentify an individual from their metadata. As a result, private information such as PFI, PHI and contact information could end up in the public domain. In the wrong hands, this could be catastrophic.

There have been many incidents in which this has already happened. In 2014, the New York City Taxi and Limousine Commission released anonymized datasets of about 173 million taxi trips. The anonymization, however, was so inadequate that a grad student was able to cross-reference this data with Google images on “celebrities in taxis in Manhattan” to find out celebrities’ pickup and drop-off destinations and even how much they paid their drivers.

Another example is the Netflix Prize dataset contest, in which poorly anonymized records of customer data were reversed by comparing it against the Internet Movie Database. What is most alarming in situations like these is the information set that can be created by aggregating information from other sources. By combining indirectly related factors, one can reidentify seemingly anonymized personal data.

Research conducted by the Imperial College London found that “once bought, the data can often be reverse-engineered using machine learning to reidentify individuals, despite the anonymization techniques. This could expose sensitive information about personally identified individuals and allow buyers to build increasingly comprehensive personal profiles of individuals. The research demonstrates for the first time how easily and accurately this can be done — even with incomplete datasets. In the research, 99.98% of Americans were correctly reidentified in any available ‘anonymized’ dataset by using just 15 characteristics, including age, gender and marital status.”

While you fight between ensuring both data security and data functionality, you find yourself in a bind, choosing what to trade for the other. But does it have to be this way?

To keep your business going, of course, you’re going to have to make compromises — but in a way that doesn’t challenge the security, performance or functionality of the data nor the compliance or privacy of the individual you’re trying to protect.

We now have an awareness that data masking, anonymization and pseudonymization might not be as easy as we thought it would be. As a concept, it may seem deceptively simple, but approaching it in a simple, straightforward manner will limit your ability to scale your solution to fit the many ways your business will need to apply anonymization. This inability to scale in terms of masking methodologies ends up in project failure. 

The focus really needs to be on reducing the risk of reidentification while preserving the functionality of the data (data richness, demographics, etc.). In part two, I will talk about how you can achieve this.

November 12, 2020
Data Security Challenges in Financial Service Industry
The industry most targeted by cybercriminals is the financial services industry. Because of the sheer volume of sensitive financial data carried by this industry, it serves as a hotspot for cyberattacks. 47.5% of financial institutions were breached in the past year, while 58.5% have experienced an advanced attack or seen signs of suspicious behavior in their infrastructure.

There are also quite a few regulations that govern this industry in specific, such as the PCI-DSS (Payment Card Industry Data Security Standard), GLBA (Gramm-Leach-Bliley Act, aka the Financial Services Modernization Act), and BCBS 239 (Basel Committee on Banking Supervision’s regulation number 239). Although the industry is heavily regulated, it has a significantly high data breach cost at $5.86 million2. So, data falling into unauthorized hands not only results in non-compliance for the organization but also puts it at financial risk due to the high cost of data breaches

Challenges

Third-party risks

The use of third-party vendors is one of the major forces behind the cybersecurity risk that threatens financial institutions3. Most organizations work with hundreds or even thousands of third parties, creating new risks that must be actively handled. The financial sector has massive third-party networks that pose new weak spots in cyber defense. In the next two years, we can expect to see an exponential rise in attacks on customers, partners, and vendors4. Without continuous monitoring and reporting, along with the use of critical tools to do so, organizations are vulnerable to data breaches and other consequences.

Data transfers (cross-border data exchanges)

The most fundamental challenge is to keep your private data private. Given that the financial sector produces and utilizes a massive amount of sensitive data, and is highly regulated, cybersecurity becomes paramount. Adequate measures are needed to protect data at rest, in use, and in motion.
Security concerns
Problems arise with data security when employees, security officials, and others tasked with protecting sensitive information fail to provide adequate security protocols. They may become careless about leaving their credentials around at home or in public places. Other issues arise when networks and web applications provided by institutions don’t have enough safeguards to keep out hackers looking to steal data.
According to SQN Banking Systems, the five biggest threats to a bank’s cybersecurity include:
• Unencrypted data
• Malware
• Non-secure third-party services
• Manipulated data
• Spoofing

Evolution of technology and the threat landscape

Technology evolves daily; what we’re using now might be obsolete in the coming year. At the same time, cybercriminals are also equipping themselves to face technological advancements head-on. Looking at the alarming number of cybercrimes in the past year, they are much advanced than the technology we are using. In such a scenario where criminals are always one step ahead of the organization, blocking threats becomes a difficult task.

Evolving customer needs and organizations

Just as technology evolves, so do organizations and the way they function. Customer needs are ever-increasing, and they want quick solutions. What customers might not realize is that appealing technology comes with its set of risks. Moreover, there probably isn’t a financial institution today that hasn’t explored digital and mobile platforms. As they continue to keep expanding and using these platforms to cater to their consumers, they’re incidentally open to cyber risk exposure. Retaining customer confidence while meeting their growing needs to newer technologies becomes a complicated process.
Remaining compliant
Alongside dealing with the challenges mentioned above, it is imperative that financial institutions also put in the effort to stay compliant with laws such as the GDPR and CCPA to avoid hefty fines and other concerns such as revenue loss, customer loss, reputation loss, and the like.

Solutions
- Monitor user activity for all actions performed on sensitive data in your enterprise.
- Choose from different methods or select a combination of techniques such as encryption, tokenization, static, and dynamic data masking to secure your data, whether it’s at rest, in use, or in motion.
- Before this step, sensitive data discovery is a must, because if you don’t know where your data is, how will you protect it?
- Deploy consistent and flexible data security approaches that protect sensitive data in high-risk applications without compromising the application architecture.
- Your data security platform should be scalable and well-integrated, which is consistent across all data sources and span both production and non-production environments.
- Finally, ensure the technology you’re implementing is well integrated with existing data protection tools for efficient compliance reporting and breach notifications.
Conclusion

If financial institutions think they can dodge cyberattacks with the help of mediocre data security strategies, recent heists would’ve proved them wrong. And despite prevention and authentication efforts, many make the mistake of thinking anomalous and unauthorized activity will cease to occur, which is unfortunately not the case. While cyber-risk is inevitable, by implementing the right tools, and a well-defined approach to cybersecurity, financial institutions can be more prepared as threats evolve.

About Mage Data

Our data and application security platform is a single integrated platform that protects sensitive data across its lifecycle, with modules for sensitive data discovery, static and dynamic data anonymization, data monitoring, and data minimization. Our solutions in data security are certified, tested, and deployed across a range of customers all over the globe. We have successfully implemented our solution in many large financial institutions – top private bank in the Dominican Republic, one of the largest Swiss Banks, a global financial services software manufacturer, the world’s largest credit rating agency, and a top commercial bank in the United Arab Emirates.

How a Swiss Bank is effectively handling data security?

A top Swiss Bank was looking to optimize costs by offshoring IT application development while ensuring compliance without compromising on sensitive data controls. This posed a need for sensitive data discovery and masking both production and non-production environments. In a highly regulated environment, they were able to deploy our product suite, a sophisticated solution that met their needs and was able to successfully achieve secure cross-border data sharing and sensitive data assessment for cloud migration and compliance.

You can download the case study here: Mage Customer Success Stories – Comprehensive Security Solution for a Top Swiss Bank

References

1) Bitdefender – Top security challenges for the Financial Services Industry in 2018
2) Ponemon Institute – Cost of a Data Breach Report, 2019
3) PwC report – Financial services technology 2020 and beyond: Embracing disruption
4) Protiviti – The Cybersecurity Imperative: Managing Cyber Risks in a World of Rapid Digital Change
5) NGDATA: The Ultimate Data Privacy Guide for Banks and Financial Institutions
November 12, 2020