December 13, 2022
What’s the Best Method for Generating Test Data?
All data contains secret advantages for your business. You can unlock them through analysis, and they can lead to cost savings, increased sales, a better understanding of your customers and their needs, and myriad other benefits.
Unfortunately, sometimes bad test data can lead companies astray. For example, IBM estimates that problems resolved during the scoping phase are 15 times less costly to fix than those that make it to production. Getting your test data right is essential to keeping costs low and avoiding unforced mistakes. Here’s what you need to know about creating test data to ensure your business is on the right path.
What Makes a Test Data Generation Method Good?
While all data-driven business decisions require good analysis to be effective, good analysis of bad data provides bad results. So, the best test data generation method will be the one that consistently and efficiently produces good data on which you can run your analysis within the context of your business. To ensure that analysis is based on good data, companies should consider the speed, compliance, safety, accuracy, and representation of the various methods to ensure they’re using the best method for their needs.
Companies often hold more personal data than many customers realize and keeping that data safe is an important moral duty. However, test data generation methods are rarely neutral when it comes to safety. They generally either make personal data less safe, or they make it safer.
Each year, governments pass new data protection laws. If the moral duty to keep data secure wasn’t enough of an incentive, there are fines, lawsuits, and in some countries, prison time, awaiting companies that don’t protect user data and comply with all relevant legislation.
If you or your analysts are waiting on the test data to generate, you’re losing time that could be spent on the analysis itself. Slow data generation can also result in a general unwilling to work with either the most recent or representative historical data, which lowers the potential and quality of your analysis.
Accuracy and representation
While one might expect that all test data generation methods would result in accurate and representative data, that’s not the case. Methods vary in accuracy, and some can ultimately produce data that bears little resemblance to the truth. In those situations, your analysis can be done faithfully, but the underlying errors in your data can lead you astray.
Test Data Generation Methods
By comparing different methods of test data through the lens of these four categories, we can get a feel for the scenarios in which each technique would succeed or struggle and determine which approaches would be best for most companies.
The oldest method of generating test data on our list is database cloning. The name pretty much gives away how it works: You take a database and copy it. Once you’ve made a copy, you run your analysis on the copy, knowing that any changes you make won’t affect the original data.
Unfortunately, this method has several shortcomings. For one, it does nothing to secure personal data in the original database. Running analysis can create risks for your users and sometimes get your company into legal trouble.
It also tends to suffer from speed issues. The bigger your database, the longer it takes to create a copy. Plus, edge cases may be under- or over-represented or even absent from your data, obscuring your results. While this was once the way companies generated test data, given its shortcomings, it’s a good thing that there are better alternatives.
Database virtualization isn’t a technique solely for creating test data, but it makes the process far easier than using database cloning alone. Virtualized databases are unshackled from their physical hardware, making working with the underlying data extremely fast. Unfortunately, outside of its faster speed, it has all the same shortcomings as database cloning: It does nothing on its own to secure user data, and your tests can only be run on your data, whether it’s representative or not.
Data subsetting fixes some of the issues found in the previous approaches by taking a limited sample or “subset” of the original database. Because you’re working with a smaller sample, it will tend to be faster, and sometimes using a selection instead of the full dataset can help reduce errors related to edge cases. Still, when using this method, you’re trading speed for representativeness, and there’s still nothing being done to ensure that personal data is protected, which is just asking for trouble.
Anonymization fixes the issue with privacy that pervades the previous approaches. And while it’s not a solution for test data generation on its own, it pairs nicely with other approaches. When data is anonymized, individual data points are replaced to protect data that could be used to identify the user who originated the data. This approach makes the data safer to use, especially if you’re sending it outside the company or the country for analysis.
Unfortunately, anonymization has a fatal flaw: The more anonymized the dataset is, the weaker the connection between data points. Too much anonymization will create a dataset that is useless for analysis. Of course, you could opt for less anonymization within a dataset, but then you risk reidentification if the data ever gets out. What’s a company to do?
Synthetic data is a surprisingly good solution to most issues with other test data approaches. Like anonymization, it replaces data to secure the underlying personally identifiable information. However, instead of doing it point by point, it works holistically, preserving the individual connections between data while changing the data itself in a way that can’t be reversed.
That approach gives a lot of advantages. User privacy is protected. Synthetic datasets can be far smaller than the original ones they were generated from, but still represent the whole, giving speed advantages. And, it works well when there’s not a ton of data to be used, either, helping companies run analysis at earlier stages in the process.
Of course, it’s far more complex than other methods and can be challenging to implement. The good news is that companies don’t have to implement synthetic data on their own.
Which Test Generation Data Method is Best?
The best method for a company will vary based on its needs, but based on the relative performance of each approach, most companies will benefit from using synthetic data as their primary test data generation method. Mage’s approach to synthetic data can be implemented in an agent or agentless manner, meeting your data where it lives instead of shoehorning a solution that slows everything down. And while it maintains the statistical relationships you need for meaningful analysis, you can also add noise to its synthetic datasets, allowing you to discover new edge cases and opportunities, even if they’ve never appeared in your data before.
But that’s not all Mage can do. Between its static and dynamic masking, it can protect user data when it’s in production use and at rest. Plus, its role-based access controls and automation make tracking use simple. Mage is more than just a tool for solving your test data generation problems—it’s a one-platform approach to all your data privacy and security needs. Contact us today to see what Mage can do for your organization.