November 11, 2020
Test Data Management Best Practices
Do your test data environments put Production data at risk of exposure?
Since test data environments usually require real-world data to tackle complex issues, issues that may not be replicated with fake data, they present one of the most significant security risks to sensitive data. Credentials may not be as secure as for Production, and access may not be as stringently monitored. There’s too much access in some cases. And unauthorized access can reveal troves of production data or other information that can provide a foothold to greater access to protected data or systems.
So how do we enable effective test data management while minimizing risk?
First of all, everyone likely agrees that access should be on the principle of least privilege (limited access to the test environment, and nothing else). Combine that with two-factor authentication as a second line of defense. So far, no problem.
Second, don’t use real data (or mask it if you can’t avoid it).
You have some useful options to minimize the risks of loading real data into a test data environment. Both data subsetting and data virtualization minimize risks while enabling efficiency. Using test data generation enables you to avoid loading real data altogether, and finally, data masking allows you to protect the real data. Let’s take a look at these options.
- Data subsetting consists of taking a subset (usually of a much smaller size as a whole) from one or more production databases. This small size is a significant advantage since it makes both test data distribution and testing much faster than a complete database clone. There are some challenges with this approach. For example, you must have a way of ensuring that your subset is representative of your entire dataset, and it must be referentially intact. And it still exposes Production data.
- Data virtualization has a similar motivation to data subsetting, at its core: take large production databases and make them efficient to distribute and test. However, data subsetting does this by reducing the amount of data; virtualization allows data stored in different types of data models, which are integrated virtually. It doesn’t replicate data from source systems, but only stores the integration logic for viewing. So, there’s still some risk in this method.
- Manual test data generation can be a tedious and time-consuming process; additionally, it can be difficult to manually ensure that all attributes are present in the data to make it “testable.”
- Finally, synthetic data generation breaks with data subsetting and data virtualization by opting to disregard your production data for use as test data. Instead, it allows you to create your own “synthetic” test data. This test data will look real – and will be representative of your production data – while, at the same time, being entirely fake. The biggest obstacle is how to achieve this, making sure your test data covers a range of relevant test cases. A secondary concern is how avoiding making the process so laborious that it loses any benefit over the manual creation of test data.
Each of these options has a drawback that, when you are looking to just get the job done, may mean loading real (production) data in your test data environment. And even with data subsetting and data virtualization, you will be distributing and exposing significant quantities of production data to your testers and leaving it exposed to unauthorized access.
Anonymizing the data is the gold standard in these cases. To make anonymization (masking) successful, these key considerations must be kept in mind:
- Sensitive data discovery: apply a comprehensive discovery solution to find all of the data that needs to be masked.
- Referential Integrity: ensure consistency and functionality of data instances during roll-out and consistent masking of the data itself across applications and databases.
3. Data for testing: developers and testers DO NOT need to see the real data. What they do require, however, is realistic data, which preserves formats and passes validations.
- Efficiency: to ensure efficiency in the masking process, consider performance constraints, security policies, and environmental limitations.
A note of warning: home-grown scripts for data masking are the path of least resistance but are not the most effective — they generally do not eliminate sensitive data and, worse, can cause inconsistency in masking rollouts.
Unless you are using synthetically generated data, you will need to a) find and b) anonymize any sensitive information within your test data before distributing it to your testers. This is usually achieved via comprehensive data discovery and static data masking capabilities, respectively. Dynamic data masking and encryption may also be used as ancillary capabilities to complete the toolkit. There’s no reason to expose data, even in subsets, when anonymization can create a realistic and useful test data environment.
Mage test data management solution includes integrated and comprehensive discovery, static, and dynamic data masking solutions, along with a data subsetting option. Additionally, with Mage Identities to you can create generalized data sets from internal or external data sources, a process that is a lot more efficient and functionally capable than synthetic data generation. To read more about the Test Data Management market and vendors, download the Bloor TDM market update.