Ultimate Guide to Test Data Management
- What Is Test Data Management (TDM)?
- Why Test Data Management Is Important
- What Is Data Provisioning in Test Data Management?
- What Are Some Features of Test Data Management Tools?
- Common Use Cases for TDM Tools
- What to Look for in a Test Data Management Tool
- Will Open-Source Tools Work for Test Data Management?
- How Mage Data Helps with Test Data Management
Test data management (TDM) is a critical process for businesses to master. When this process isn’t running optimally, companies can expect more frequent delays, higher costs, and reduced project management and planning efficiency. However, getting test data management right can improve productivity, empower workers, and increase the amount of on-time and on-budget project completions. While there is a lot to know about test data management, this guide is designed as a starting point for businesses looking to master the process.
What Is Test Data Management (TDM)?
Whether testing applications or performing analytics, businesses need data sets that accurately reflect the current state of their production data. In most cases, using actual production data is not realistic—the data sets are too large, and there are concerns about privacy and security. Thus, the test data used must either be a representative subset of production data, or else synthetic data that has been crafted to mimic production data (and hopefully preserve key relationships between data points).
TDM is the process of preparing data for testing and analysis. This includes reducing defects in the data (such as misleading edge cases), as well as processes for ensuring data is properly secured through means such as masking, scrambling, or pseudonymization. Because of its role in modern software development, it is sometimes referred to as “DevOps Test Data Management” or DevOps TDM.
TDM serves a critical function in the application development and testing lifecycle; improper testing can lead to more bugs in production, higher testing and development costs, and frequent production bottlenecks. And with the increasing adoption of Agile Development methodologies and Continuous Integration/Continuous Delivery (CI/CD) pipelines, testing is occurring sooner in development and more frequently, leading to a heightened need for faster and more agile provisioning of data.
Poor TDM also can mean that applications fail to meet regulatory compliance and privacy rules, running the risk of exposing sensitive private data.
Importantly, TDM isn’t a “one off” or static event. Rather, it is an ongoing effort to continually provide new data sets that reflect current realities while also ensuring that this data is fully compliant with data privacy laws.
Why Test Data Management Is Important
Put simply, good testing practices help organizations create better, more reliable software. But there often is a trade-off between the rigor of those tests, and how quickly they can be conducted. The generation of adequate test data is one of the bigger challenges that force this trade-off. With good TDM, DevOps and analysis teams can speed up the process without sacrificing quality.
The need for testing rigor should be obvious to anyone who has ever interacted with software that has been released too early. The lack of quality control leads to bug fixes and rollbacks, as well as a poor user experience. It also can introduce compliance and security risks.
Preventing this scenario requires rigorous testing, including testing of edge cases. Thus, data sets used in testing must be thorough and adequately represent production data.
Most organizations, however, do not have processes in place for reliably creating (provisioning) and distributing test data in a way that is compliant and secure. Even when processes are in place, they might not allow for the agile provisioning of data, which creates a bottleneck that can lead to delays in putting new features or other changes into production. Likewise, if providing test data takes too long, the project management team might have to decide between waiting for appropriate testing, or releasing the software features without thorough testing.
TDM is especially important in larger enterprise-sized organizations, where data often is fragmented across various databases and possibly managed by different service lines. For example, customer transaction data (or other kinds of data on usage) might reside in one database, with billing information in another, and customer service info—such as help requests—might live in a third. Managing test data includes finding ways to speed up the consolidation process or developing a Single Source of Truth where all relevant data can be found.
Data provisioning is the process of providing prepared test data sets to your workers. While this might sound like a minor process at first, it can become a bottleneck for DevOps teams, especially as they move testing earlier in the development process. Consequently, it’s crucial for companies to be intentional about TDM to avoid such bottlenecks and keep test data flowing smoothly.
While the approach to test data provisioning will vary by company, it’s important to have a documented procedure that answers some of the following questions:
- How should testers request new datasets?
- Is the process largely manual, and if so, are there ways in which it can be automated?
- How quickly should test data managers create new datasets upon receiving a request?
- When should developers/DevOps professionals use a self-service portal or request an entirely new dataset?
By documenting answers to questions like these, companies can create a policy that reduces confusion and increases productivity during the test data provisioning process.
What Are Some Features of Test Data Management Tools?
While different TDM tools will have different feature sets, we’ve compiled a list of some that we believe companies should focus on.
Put simply, data discovery is the process of uncovering and organizing data sources so that all of an organization’s data is known and accessible. Accurate and comprehensive data discovery is a crucial first step in creating effective test data sets and ensuring user privacy is maintained.
For example: When a subset of production data is used for testing, the test data will inevitably contain sensitive or personal information. This sensitive data needs to be masked or removed prior to testing to ensure security and comply with data privacy laws. However, without the proper tools and management of the data, it is possible for sensitive data to go unnoticed—for example, a piece of sensitive information in a field where it is not expected, such as an address placed in a “comments” field.
Data discovery is the first step in automating the process of identifying, classifying, and protecting sensitive data. This helps give teams a clearer understanding of the organization’s data landscape and, with the right tools, can prevent sensitive data from leaking into test data sets.
There are many different approaches to generating test data sets, but a few of the most common are database cloning, subsetting, database virtualization, and synthetic data generation.
- Database cloning takes a pre-existing production database and makes a complete copy. This data can now be used outside of production systems for testing, though it will still need to be secured via masking, encryption, tokenization, or similar process. Cloning is also impractical for testing or analysis when there are large, complex databases involved
- Subsetting involves selecting a representative portion of a database for a particular test or analysis task. This can increase the speed and accuracy of these operations, but like cloning, it still carries risks related to data privacy because it is derived from production data.
- Data virtualization creates a virtual copy of a database that uses pointers to blocks of data in the original. This allows high-speed access to data without creating a new database or block of data. Good data virtualization tools also provide support for data masking for use in TDM.
- Synthetic data generation provides yet another option; synthetic data sets contain entirely fabricated data points while hopefully preserving the original data types and statistical relationships found in production data sets. With synthetic data, there literally are no real-world entities represented in the data—for example, no real users—and so no sensitive data to worry about. That said, synthetic data often omits important details or relationships needed for accurate prediction and testing.
Regulations around how companies handle data have proliferated worldwide and continue to grow more burdensome as new laws emerge and old ones are updated. This means extra care must be taken to ensure that any data provisioned for testing or analysis is appropriately secured.
There are several ways to do this, but they all fall under two general categories: Anonymization methods and pseudonymization methods. Anonymization is the permanent replacement of sensitive data with unrelated characters, which makes the data extremely difficult to be re-identified. Anonymization methods include generalization, scrambling, masking, and shuffling/permutation.
With pseudonymization, sensitive data is replaced but in such a way that aspects of the data still remain. For example, a name such as “Dave Jones” might be replaced by “John Smith.” The data is still identifiable as a masculine name, with four letters in the first name and five letters in the last name. The goal is to have the data retain enough of its characteristics that it is still useful for testing and analysis purposes. However, the data can be re-identified with the help of an identifier (additional information). Pseudonymization merely reduces the linkability of a dataset with the original identity of an individual.
Organizations can select one or more techniques for securing their data, depending on the degree of risk and the intended use of that data. Best-in-class TDM tools offer an array of techniques so that there are options for every use case.
On-Demand Data Delivery
One of the biggest bottlenecks in the testing process is organizational—who is tasked with provisioning and vetting test data? In most organizations, this task lies with a single person or small team that also has a number of other responsibilities.
Thus, one of the best ways to incorporate testing into a CI/CD pipeline—and to further empower your developers and testers—is to provide a self-service portal. The idea is to give analysts and testers the ability to provision their own data when needed for testing. This can be done by giving them access to pre-generated data sets, or by allowing them directly to generate data sets that meet their specific needs. By providing self-service access to test data, DevOps and analysis teams can avoid delays caused by waiting for test data, ensuring that projects are completed on time and within budget.
Common Use Cases for TDM Tools
Modern software is increasingly being developed and delivered using DevOps methodologies and CI/CD workflows. Critical to both is the idea that testing should “shift left,” occurring earlier in the development cycle. For this to happen, TDM tools need to enable developers to request or provision test data easily.
Thus, TDM tools are best suited for organizations where:
- Testing needs to be done more quickly and efficiently (without dramatically increasing costs),
- There is a desire for a “self service” option for data provisioning (allowing DevOps professionals to perform testing on their own machines),
- Data is required for negative path testing, as well as null, erroneous, and boundary conditions,
- More automation of the testing process is desired, or needed for the organization to scale, and/or
- Data security and compliance are an issue.
These use cases are neither mutually exclusive nor exhaustive. The more that apply to your organization, the more it can benefit from a robust TDM tool.
What to Look for in a Test Data Management Tool
Different companies will have different needs and thus different requirements for a TDM tool. However, we do feel there are some common features that most businesses can benefit from:
- Data security. TDM tools should have robust security features built-in, helping to protect databases (or clones) from potential threats and to ensure compliance with security standards. For example, sensitive data should be masked before viewed or used.
- Flexibility. Businesses need flexibility in their TDM approach, as processes and data layers can vary. While some solutions require multiple tools, a few (like Mage Data) enable comprehensive management with a single solution.
- Performance. The enterprise data environment is continually growing more complex. Companies today may manage billions of individual data points in a mix of on-premises and cloud databases, supporting a worldwide team of developers and analysts. TDM tools need to be capable of supporting this kind of complexity.
- Connectivity. Larger companies often need TDM tools that can integrate with third-party solutions and legacy applications and databases. Good TDM tools embrace APIs in order to allow for such connectivity. They also feature integration with CI/CD tools to help automate the testing process.
Will Open-Source Tools Work for Test Data Management?
There are a variety of open-source tools that companies can use for TDM. These free, openly accessible tools might be tempting precisely because they are free and readily available. But there are well-known downsides to using open-source tools for TDM.
One of the main drawbacks is the lack of dedicated support. While these tools often are developed by passionate developers, those developers are under no obligation to provide assistance. If you encounter an issue with an open-source tool, your only recourse may be to seek help on open forums, with no guarantee of a timely or effective solution.
Some companies will develop and release open-source tools in-house and may offer support as an add-on, but these plans can be costly, essentially requiring you to pay for development through support fees. The decision to forgo support or pay for these add-ons depends on your specific needs and risk tolerance.
Managing test data is no longer a trivial task for DevOps teams, particularly in large organizations. Poorly managed test data comes with substantial risks, including delays in testing, higher testing costs, and compliance issues. Having the right TDM tool helps to mitigate those risks.
Mage Data’s Test Data Management solution enables businesses to automate and streamline the TDM process, unlocking powerful benefits and significant productivity gains. To learn more about the tool and its features, visit our Test Data Management Solutions Page, or if you prefer, schedule a demo with our experts.