Category: Blogs – Test Data Protection and Delivery

How to Create a Secure Test Data Management Strategy

Proper Test Data Management helps businesses create better products that perform more reliably on deployment. But creating test data, in the right amount and with the right kinds of relationships, can be a much-more-challenging process than one would think. Getting the most out of test data requires more than simply having a tool for generating or subsetting data; it requires having a clear Test Data Management strategy.

Test Data Management might not be the first area that comes to mind when thinking about corporate strategy. But testing generally holds just as much potential as any other area to damage your business if handled incorrectly—or to propel you to further success if handled well.

An upshot of this is that it can help you find the best Test Data Management tools as well. After all, if the creator of a tool understands what is involved in a Test Data Management strategy, you can rest assured their tools will actually be designed to make those strategic goals a reality.

Here, then, are the elements for a successful and secure Test Data Management strategy.

The Core Elements of Test Data Management Strategies

Creating a secure Test Data Management strategy starts with having a plan that makes your goals explicit, as well as the steps for getting there. After all, it doesn’t matter how secure your strategy is, if you don’t achieve the outcomes you’re looking for. All effective Test Data Management strategies rely on the following four pillars.

Understanding Your Data

First, it’s essential that you understand your data. Good testing data is typically composed of data points of radically different types sourced from a different database. Understanding what that data is and where it comes from is necessary to determine if it will produce a test result that reflects what your live service offers. Companies must also consider the specific test they’re running and alter the data they choose to produce the most accurate results possible.

De-Identifying Data

Second, producing realistic test results requires using realistic data. However, companies that are cavalier with their use of customer data in their tests put themselves at greater risk of leaks and breaches and may also run afoul of data privacy laws.

There are many different methods for de-identifying data. Masking permanently replaces live data with dummy data with a similar structure. Tokenization replaces live data with values of a similar structure that appear real and can be reversed later. Encryption uses an algorithm to scramble information so it can’t be read without a decryption key. Whichever approach you use, ensure your personally identifiable information is protected and used per your privacy policy.

Aggregating and Subsetting Data

Third, your company may hold hundreds or billions of data points. Trying to use all of them for testing would be extremely inefficient. Subsetting, or creating a sample of your data that reflects the whole, is one proven method for efficient testing. Generally, data must also be aggregated from multiple different sources to provide all the types of data that your tests require.

Refreshing and Automating Test Data in Real Time

Finally, your company is not static. It changes and grows, and as it does, the data you hold can shift dramatically. If your test data is static, it will quickly become a poor representation of your company’s live environment and cause tests to miss critical errors. Consequently, test data must be regularly refreshed to ensure it reflects your company in the present moment. The best way to accomplish that task is to leverage automation to refresh your data regularly.

What Makes a Test Data Management Strategy Secure?

The reality of using test data is that, if improperly handled, it multiplies the preexisting security issues that your company already has. For example, if you take one insecure dataset and create five testing datasets, you end up with at least six times as much risk.

When your data isn’t secure to begin with, securing your test data won’t make a meaningful impact on your overall security posture. At the same time, creating test data comes with its own risks. Data will be stored in new locations, accessed by more people than usual, and used in ways that it might not be during the normal course of business. That means you need to pay special attention to your test data to keep it secure.

The following framework provides a way to think through the new risks that test data create.

Who?

First is the who. In addition to the people assembling the test datasets, other people (such as back- and front-end developers, or data analysts) will come in contact with the test data. While it’s tempting to provide all of them with the same data, the reality is that the data they need to do their job will vary from role to role. Your experienced lead developer will need a higher level of insight for troubleshooting than a junior developer on their first day on the job. To maximize your security around this data, you need a tool that can help you make these kinds of nuanced decisions about access.

What?

Knowing what data you’re using matters. With an ever-growing number of data privacy laws around the world, businesses must be able to detail how they’re using data in their operations. Using data that’s not covered by your privacy policy or using data in a manner that isn’t covered could result in serious regulatory action, possibly in multiple countries at once. Companies increasingly need to be able to prove they’re in compliance, which is most easily accomplished with robust audit logging.

Where?

An increasing number of countries are penalizing companies for offshoring data, especially if it isn’t declared in the enterprise’s privacy policy. Even with that in mind, running your data analysis in other countries may still make financial sense. In that situation, companies should evaluate whether masked or entirely synthetic datasets would suffice to reduce the risk of regulatory action or leaks that come with moving data across borders.

How?

The growing complexity of securing your Test Data Management process means that it’s no longer possible for humans to oversee every part of the process. A good policy starts with your human workers setting the rules, but then a technological solution is needed to handle the process at the scale required for modern business applications.

Overall, Test Data Management strategies will vary from company to company. However, by following the principles in this article, companies can develop an approach that meets their testing needs while ensuring that data is kept secure.

How Mage Data Can Help with Test Data Management

While it would be dramatic to suggest that a poor Test Data Management strategy could doom a business, it’s not an exaggeration to suggest that a poor strategy drives up costs in a measurable way. Poor testing can easily lead to a buggier product that takes more time and costs more money to fix. And a worse product could lose customers, even as the expanded fixes hurt your bottom line. The good news is that companies don’t have to develop their Test Data Management strategy on their own. Mage’s Data suite of Test Data Management tools provides everything businesses need to build their test data pipeline while having the customization they need to make it their own. Schedule a demo today to see Mage Data in action.

April 25, 2023
Best Practices for Test Data Management in Agile

Agile is a growing part of nearly every business’ software development process. Agile can better align teams with the most pressing customer issues, speed up development, and cut costs. However, like just-in-time manufacturing, Agile’s unique approach to development means that a delay in any part of the process can lead to a screeching halt across all of it. Testing software solutions, as the last step before deployment, is critical to ensuring that companies ship working software, as well as catching and resolving edge cases and bugs before the code goes live (and becomes far more expensive to fix). If Test Data Management is handled poorly in an Agile environment, the entire process is at risk of breaking down.

Why Test Data Management is a Bigger Challenge in Agile

As companies produce and consume more and more data, managing your test data is an increasing challenge. The key to success in leveraging test data is the realization that, the more your test data represents your live data, the better it will be at helping you uncover bugs and edge cases before deployment. While using your live production data in your tests would resolve this issue, that approach has serious data privacy and security concerns (and may not be legal in some jurisdictions). At the same time, the larger your dataset, the slower your tests.

In a traditional waterfall approach to development, a “subset, mask, and copy” approach generally ensures that data is representative of your live data, small enough for efficient testing, and meets all data privacy requirements. With the testing cycle lasting weeks or months and known well in advance, it’s relatively easy to schedule data refreshes as needed to keep test data fresh.

Agile sprints tend to be much shorter than the traditional waterfall process, so the prep time for test data is dramatically shortened. A traditional subset, mask, copy approach could severely impede operations by forcing a team to wait on test data to start development. Even worse, it could create a backlog of completed but untested features waiting for deployment, which would require companies to keep teams from starting new stories or pull people off a project to fix bugs after testing is completed. Both hurt efficiency and prevent companies from fully implementing an Agile development process.

Best Practices for Effective Test Data Management in Agile

Unfortunately, there are no shortcuts to Test Data Management in an agile system. You have to do everything you would have done in a traditional approach, but significantly speed up the process to ensure it’s never the bottleneck. Implementing this system can require a change in institutional thinking. Success in this area means finding new ways to integrate your testers and data managers into the development process and providing them with the tools they need to succeed in an Agile environment.

1. Integrate Data Managers into the Planning Process

No matter how efficient your test data managers are, creating the right dataset for testing for a particular customer story takes at least some time. Waiting until after the planning phase is over to inform your data team of the needed data will lead to delays just from the time needed to create a dataset. However, if more esoteric data is required, the delay could be much longer than typical. By integrating your data team into the planning phase, they can leverage their expertise to help identify potential areas of concern before the start of the development phase. They can also begin working on the test datasets before the start of development, potentially providing everything needed for development and testing on the first day of development.

2. Adopt Continuous Data Refreshing

At most companies, data managers support multiple teams. With different customer stories requiring different amounts of time to complete, the data team must be flexible and efficient to meet sometimes unpredictable deadlines. However, that doesn’t excuse them from ensuring that data is up to date, that it’s free of personally identifiable information, or that it’s subset correctly for the test.

The good news is that significant portions of this process can be automated with the right tools. Modern tools can identify PII in a dataset, enabling rapid, automated transformation of an insecure database into a secure one for testing. Plus, synthetic generation tools can help companies rapidly create great datasets for testing that include no reversible PII while maintaining important referential integrity. With these processes in place, testing teams will be better equipped to handle the pace of Agile while also spending more time on high-value planning operations rather than low-level data manipulation.

3. Create a Self-Service Portal

One thing that’s guaranteed to slow Agile teams down is a formal request process to access test data. While tracking who is accessing what data is important, access requests and tracking can largely be automated with today’s tools. This idea can be taken one step further by creating a self-service portal that includes basic datasets for common development scenarios. A self-service portal ensures that smaller teams or side projects can run meaningful tests without tying up your data manager’s resources. Just like with your primary testing datasets, these must be kept reasonably up to date, but automation can significantly help reduce this burden.

How Mage Data Helps with Agile Test Data Management

Agile is a process that can greatly speed up development and transform the delivery of new features to your customers. However, teams need more training and tooling to execute it effectively. Not all Test Data Management solutions are up to handling an Agile approach to development. But, Mage’s Data Test Data Management solution is, providing just about everything a company could want right out of the box, while providing flexible customization options to enable companies to build the test data pipeline that works best for their needs. Contact Mage Data today for a free demo and learn more about how Mage Data can help streamline your Agile Test Data Management process.

April 17, 2023
What Are the Best Test Data Management Tools?

Evaluating Test Data Management tools can be a challenge, especially when you may not be able to get your hands on a product before making a purchase. The good news is that prioritizing the right approach and features can help businesses maximize their ROI with Test Data Management tools. Whether you’re just starting your evaluation process or need to prepare for an imminent purchase, we’ve compiled the information you need to make the best choice for your business.

What Are the Core Elements of Successful Test Data Management?

Before examining the best Test Data Management tools in detail, we have to consider what outcomes organizations want to drive with these tools. An effective Test Data Management program helps companies identify bugs and other flaws as early in production as possible to allow for quick remediation while the mistakes are small-scale and inexpensive to rectify.

Testing of applications is almost guaranteed to fail if you manage your test data in a way that makes it not representative of the data your live applications will use. As a result, your testing will fail to uncover critical flaws, and then your company will face an ever-escalating series of expensive repairs.

Each new application or part of an application being tested will be slightly different. As a consequence, Test Data Management must be a dynamic process. Testers need to alter datasets from test to test to ensure that each provides comprehensive testing for the feature or tool in play. They will also need to be able to customize the dataset to the specific test being performed. The way the test data is stored in a database can be different from its form when used in a front-end application, so this customization step ensures there won’t be errors related to data incompatibility.

Testers also clean data before (or after) formatting it for the test. Data cleaning is the process of modifying data to remove incomplete data points and ensuring that data points that would skew the test results are removed or mitigated. Testers will also need to identify sensitive data and ensure the proper steps are taken to protect personal information during testing. Finally, most testers will take steps to automate and scale the process. Many tests are run repeatedly, so having pre-prepared datasets on hand can greatly speed up the process.

What Features Do the Best Test Data Management Tools Have?

Given the complexity of the tasks that testers need to perform for compressive and successful testing, getting the right Test Data Management tool is critically important. But that’s easier said than done. The tech world is constantly evolving, and while Test Data Management may seem like a mundane task, it needs to evolve to ensure that it continues to serve your testing teams’ needs. The best Test Data Management tools provide some or all of the following capabilities to ensure that teams are well-equipped for their testing projects.

Connection to Production Data

While you could, in theory, create a phony dataset for your testing, the reality is that the data in your production environment will always be the most representative of the challenges your live applications will face. The best Test Data Management tools make it easy for organizations to connect to their live databases and gather the information needed for their tests.

Efficient Subsetting

As we covered before, matching your data to the test is critical for success. Subsetting is the process of creating a dataset that matches the test parameters. Generally, this dataset is significantly smaller than your databases as a whole. As a result, testers need an efficient subsetting process that is fast, repeatable, and scalable, so they can spend more time running tests and less preparing.

Personally Identifiable Information Detection

An easy way to get your company into trouble with the growing number of data privacy laws online is to use data with Personally Identifiable Information (PII) in it without declaring the use explicitly in your terms of service and getting consent from your users. Consequently, using PII by accident in your testing could result in regulatory action. Test Data Management tools need to help your team avoid this all-too-common scenario by identifying PII, enabling your team to properly anonymize the dataset before it’s used.

Synthetic Data Generation

A synthetic dataset is one of the most effective ways to sidestep the PII problem. Synthetic data resembles your source information, but unlike anonymized datasets, it holds little to no chance of being reversed. Because it doesn’t contain PII, it’s not subject to data privacy laws. One risk of synthetic data is that its creation may lead to data points with distorted relationships, compromising testing or analysis. However, Mage’s Data synthetic data solution uses an advanced method that preserves the underlying statistical relationships, even as the data points are recreated. This approach ensures the dataset is representative of your data while guaranteeing that no personal information is put at risk.

How Do Current Test Data Management Tools Compare?

Now that we’ve looked at what Test Data Management programs and tools need to do for success, let’s examine how Mage Data fits into the overall marketplace.
Here we compare Mage’s Data Test Data Management capabilities against:
• Informatica
• Delphix
• Oracle Enterprise Manager
We consider not only raw capabilities, but other factors that actual users have reported as important.

Informatica vs. Mage Data

Informatica is a well-known Test Data Management solution that offers the core features needed for successful Test Data Management, including automated personal data discovery, data subsetting, and data masking. Users have raised issues with the platform though. One common customer complaint is that the tool doesn’t do anything that the competition doesn’t, has a steep learning curve, and is more expensive than competing tools. According to Gartner’s verified reviews, users rated Mage Data higher in every capability category, including dynamic and static data masking, subsetting, sensitive data discovery, and performance and scalability.

Delphix vs. Mage Data

Unlike Informatica, Delphix generally has kept up with modern developments in Test Data Management. It receives high ratings across the board for its functionality, and if that were the entire story, it would be hard to pick an alternative. But it doesn’t have an amazing user interface, which makes daily operation uncomfortable and makes setup more challenging than it needs to be. It also isn’t as interoperable as other systems, with some data sources unable to connect, limiting the tool’s usefulness. Contrastingly, Mage Data embraces APIs for connecting to data sources and provides an API setup that allows companies to integrate third-party or legacy solutions for Test Data Management, ensuring that companies aren’t locked out of the functionality they need.

Oracle Enterprise Manager vs. Mage Data

Oracle has long been one of the kings of database tools, so it’s unsurprising that Oracle Enterprise Manager’s Test Data Management solution is generally well-regarded. It’s especially praised for both its power and its user-friendliness, which is to be expected from a company of this size. What may surprise you is that based on customer reviews, Mage Data outperforms Oracle Enterprise Manager in 6/6 capability categories, 2/2 for Evaluation & Contracting, 4/4 for Integration & Deployment, and 2/3 for Service & Support. Given Oracle Enterprise Manager’s generally high price, and Mage’s Data better performance across almost all categories, Mage Data clearly comes out on top.

Why Choose Mage Data for Test Data Management

For success in Test Data Management, you need a solution that provides comprehensive coverage for each use case you may face. Mage Data knows this, and that’s why its Test Data Management solution provides nearly everything you could need right out of the box, including database cloning and virtualization, data discovery and masking, data subsetting, and synthetic data generation. However, we know that not all use cases will be perfectly met with an off-the-shelf tool, so our system also allows for flexible customization for niche business needs that require a special touch.

How Mage Data Helps with Test Data Management

Mage Data is a best-in-class tool for Test Data Management, but that doesn’t mean the benefits stop there. Mage Data provides a suite of data protection tools, enabling businesses to protect every part of their stack from a single dashboard. And, unlike other tools, we’re happy to give you a peek under the hood before you sign on the dotted line. Schedule a demo today to learn more about what Mage Data can do for your business.

April 5, 2023
Four Best Practices for Protecting Private Data

If you’re approaching data security for the first time, or just need to revisit your approach to protecting private data, it can be hard to get started. Dealing with (sometimes very different) data privacy laws, ensuring that your company follows procedure, and tracking down the gaps in your security can all be challenging. Here are a few concrete things you can do to turbocharge your approach to protecting private data and ensure that you’re taking care of your customers, too.

Best Practice #1: Create a Privacy Policy

It may seem a little strange to start with the creation of your privacy policy when protecting private data. However, there are two powerful reasons that you should make this one of your priorities. The first is that your privacy policy is a required legal document in a growing number of countries around the world. Not having one could subject you to heavy fines. Second, creating a privacy policy forces your company to document the different ways you use customer data and think critically about those uses. For example, while compiling the ways in which your company uses data, you might discover that there are processes using data that are no longer necessary. By dropping these, you can free up resources.

It can also help uncover “shadow IT,” or processes put in place by your employees without the official sanction of the company. These processes can expose you to liability, even if you don’t intend for them to be happening. That’s not to say that your privacy policy should be set in stone. Instead, it should be a living document, able to evolve as your business needs change. At the same, you should ensure your employees understand that they cannot change how they handle customer data without the express approval of the company. There should be a documented and clear process for requests for updates to the privacy policy to ensure that your company can remain flexible while still meeting its regulatory requirements.

Best Practice #2: Encrypt Your Data

One of the most important things you can do to protect private data is to encrypt it. Encryption takes useful data and turns it into scrambled, unreadable data (ciphertext). The data can only be turned back into its useful form through the use of a private key. Without that key, it would take as long as 13,689 trillion trillion trillion trillion years to crack the encryption if you had access to all the computing power on Earth. By that point, the data would likely no longer have any use.

The issues that stem from a lack of encryption quickly become apparent in the event of a breach. In 2019, a security researcher discovered a cache of more than 885 million sensitive documents on First American Financial’s website that were unencrypted. Consequently, anyone with the right URLs could access any of those records. Encrypting your data at rest, or when it’s not in use, prevents this kind of leak and helps keep your customers safe.

Best Practice #3: Discover and Classify Sensitive Data

While all data should have some level of security, applying your maximum efforts to every piece of data can be an inefficient approach that damages your company’s productivity. For example, suppose your company handles weather data. If that’s leaked, it’s no big deal. On the other hand, social security numbers should be handled with much more care. Treating both as if they were the same incurs the unnecessary use of computer resources for encryption and decryption, and might also cost your employees time.

The solution is to classify all of your internally generated and incoming data to ensure that it is handled correctly. Of course, this isn’t something you can do by hand, especially if your company handles millions or even billions of data points in a year. Data discovery by Mage Data uses AI and advanced Natural Language Processing to uncover all your sensitive data. It works on the databases you already have and can work incrementally as new data comes in, ensuring that you always have a complete view of what data you have so that you can secure it correctly. Because it’s driven by AI, it can also identify sensitive data with an unorthodox presentation, such as an email address with a typo or stored within header data, so that nothing slips through the cracks.

Best Practice #4: Control Data Access

Once your data is identified, classified, and encrypted, the next step is to control access to ensure that your data is protected. In the past, a username and password would be enough to keep data safe. However, that’s no longer the case. One of the most common ways data is accessed improperly is through a compromised set of credentials. One way to counter that problem is through the use of two-factor authentication: When your employee enters their correct username and password, a code will be sent to them via phone call, text, email, or a piece of physical hardware like a dongle. They won’t be able to log in unless they also enter the correct code. This means that even if your employee’s credentials are compromised, no one will be able to use them to log into their account.

It’s also important to restrict data access within your organization. Your accountant doesn’t need access to the same files your account manager does, and vice versa. Restricting their access to the files and resources they need to perform their jobs helps keep information safe in the event of a breach and limits the damage a single employee can do in an intentional leak.

Controlling access at this level requires granular tools. Mage’s Data Dynamic Data Masking offers companies everything they need to manage a workforce ranging in size from very small to enterprise. Flexible rules, including role-, user-, program-, and location-based controls, allow for extremely sensitive fine-tuning of the permissions process and ensures your employees will have what they need to work without having unnecessary access to sensitive information.

How Mage Data Can Help

Over the years, Mage Data has helped companies of all sizes enhance their approach to data privacy. Having worked with so many different clients with different needs, we know how to help you create the right approach to security for your specific needs. Schedule a demo today to learn more about what Mage Data can do for you.

February 23, 2023
What’s the Best Method for Generating Test Data?

All data contains secret advantages for your business. You can unlock them through analysis, and they can lead to cost savings, increased sales, a better understanding of your customers and their needs, and myriad other benefits.

Unfortunately, sometimes bad test data can lead companies astray. For example, IBM estimates that problems resolved during the scoping phase are 15 times less costly to fix than those that make it to production. Getting your test data right is essential to keeping costs low and avoiding unforced mistakes. Here’s what you need to know about creating test data to ensure your business is on the right path.

What Makes a Test Data Generation Method Good?

While all data-driven business decisions require good analysis to be effective, good analysis of bad data provides bad results. So, the best test data generation method will be the one that consistently and efficiently produces good data on which you can run your analysis within the context of your business. To ensure that analysis is based on good data, companies should consider the speed, compliance, safety, accuracy, and representation of the various methods to ensure they’re using the best method for their needs.

Safety

Companies often hold more personal data than many customers realize and keeping that data safe is an important moral duty. However, test data generation methods are rarely neutral when it comes to safety. They generally either make personal data less safe, or they make it safer.

Compliance

Each year, governments pass new data protection laws. If the moral duty to keep data secure wasn’t enough of an incentive, there are fines, lawsuits, and in some countries, prison time, awaiting companies that don’t protect user data and comply with all relevant legislation.

Speed

If you or your analysts are waiting on the test data to generate, you’re losing time that could be spent on the analysis itself. Slow data generation can also result in a general unwilling to work with either the most recent or representative historical data, which lowers the potential and quality of your analysis.

Accuracy and representation

While one might expect that all test data generation methods would result in accurate and representative data, that’s not the case. Methods vary in accuracy, and some can ultimately produce data that bears little resemblance to the truth. In those situations, your analysis can be done faithfully, but the underlying errors in your data can lead you astray.

Test Data Generation Methods

By comparing different methods of test data through the lens of these four categories, we can get a feel for the scenarios in which each technique would succeed or struggle and determine which approaches would be best for most companies.

Database Cloning

The oldest method of generating test data on our list is database cloning. The name pretty much gives away how it works: You take a database and copy it. Once you’ve made a copy, you run your analysis on the copy, knowing that any changes you make won’t affect the original data.

Unfortunately, this method has several shortcomings. For one, it does nothing to secure personal data in the original database. Running analysis can create risks for your users and sometimes get your company into legal trouble.

It also tends to suffer from speed issues. The bigger your database, the longer it takes to create a copy. Plus, edge cases may be under- or over-represented or even absent from your data, obscuring your results. While this was once the way companies generated test data, given its shortcomings, it’s a good thing that there are better alternatives.

Database Virtualization

Database virtualization isn’t a technique solely for creating test data, but it makes the process far easier than using database cloning alone. Virtualized databases are unshackled from their physical hardware, making working with the underlying data extremely fast. Unfortunately, outside of its faster speed, it has all the same shortcomings as database cloning: It does nothing on its own to secure user data, and your tests can only be run on your data, whether it’s representative or not.

Data Subsetting

Data subsetting fixes some of the issues found in the previous approaches by taking a limited sample or “subset” of the original database. Because you’re working with a smaller sample, it will tend to be faster, and sometimes using a selection instead of the full dataset can help reduce errors related to edge cases. Still, when using this method, you’re trading speed for representativeness, and there’s still nothing being done to ensure that personal data is protected, which is just asking for trouble.

Anonymization

Anonymization fixes the issue with privacy that pervades the previous approaches. And while it’s not a solution for test data generation on its own, it pairs nicely with other approaches. When data is anonymized, individual data points are replaced to protect data that could be used to identify the user who originated the data. This approach makes the data safer to use, especially if you’re sending it outside the company or the country for analysis.

Unfortunately, anonymization has a fatal flaw: The more anonymized the dataset is, the weaker the connection between data points. Too much anonymization will create a dataset that is useless for analysis. Of course, you could opt for less anonymization within a dataset, but then you risk reidentification if the data ever gets out. What’s a company to do?

Synthetic Data

Synthetic data is a surprisingly good solution to most issues with other test data approaches. Like anonymization, it replaces data to secure the underlying personally identifiable information. However, instead of doing it point by point, it works holistically, preserving the individual connections between data while changing the data itself in a way that can’t be reversed.

That approach gives a lot of advantages. User privacy is protected. Synthetic datasets can be far smaller than the original ones they were generated from, but still represent the whole, giving speed advantages. And, it works well when there’s not a ton of data to be used, either, helping companies run analysis at earlier stages in the process.

Of course, it’s far more complex than other methods and can be challenging to implement. The good news is that companies don’t have to implement synthetic data on their own.

Which Test Generation Data Method is Best?

The best method for a company will vary based on its needs, but based on the relative performance of each approach, most companies will benefit from using synthetic data as their primary test data generation method. Mage’s Data approach to synthetic data can be implemented in an agent or agentless manner, meeting your data where it lives instead of shoehorning a solution that slows everything down. And while it maintains the statistical relationships you need for meaningful analysis, you can also add noise to its synthetic datasets, allowing you to discover new edge cases and opportunities, even if they’ve never appeared in your data before.

But that’s not all Mage Data can do. Between its static and dynamic masking, it can protect user data when it’s in production use and at rest. Plus, its role-based access controls and automation make tracking use simple. Mage Data is more than just a tool for solving your test data generation problems—it’s a one-platform approach to all your data privacy and security needs. Contact us today to see what Mage Data can do for your organization.

Related blogs:

Why Does Test Data Management Matter for Your Business?
Test Data Management Best Practices

December 13, 2022
Why Does Test Data Management Matter for Your Business?

Unlocking the secrets contained in your data is a powerful way to discover new opportunities for your business. However, if a company analyzes its data in the wrong way, it can lead itself down the wrong path and waste valuable time and resources, and perhaps even risk the company’s future. Test Data Management serves as a practical control on the process to help companies ensure that the conclusions they draw from their data are grounded in reality—and are thus more likely to return positive benefits.

What is Test Data Management?

Test Data Management is the process of preparing data for analysis. This process reduces defects in the data, such as misleading or missing edge cases, and ensures there is enough data to produce meaningful conclusions. The Test Data Management process also involves securing data to protect user privacy through various means such as masking, anonymization, or pseudonymization. It also delivers the prepared data to the testing environments and generally incorporates automation where possible to reduce mistakes and speed up the overall process.

Why do Companies Need to Manage Test Data?

Companies need Test Data Management for a variety of reasons. First and foremost is the fact that you need good quality data to produce useful results. Good analyses run on bad data produce bad outcomes. In practice, improving the quality means cleaning the data.

In most cases, your data analysts can clean data manually. However, anything you can do to automate this part of the process can save time and help your analysts spend more time on more valuable activities.

Second, data needs to be accessible. In a modern enterprise environment, data can be fragmented across various databases, possibly managed by different service lines. For example, customer data about use may be in one database, with billing in another, and customer service info—such as help requests—might live in a third. Managing your test data will include finding ways to speed up the consolidation process or developing a Single Source of Truth where all relevant data can be found.

Finally, Test Data Management involves security and compliance. There are an ever-increasing number of data privacy laws, and failure to comply can result in steep fines. Properly managing your test data requires taking steps to ensure that users’ privacy is protected during analysis. It also means ensuring that actions you take with data, such as moving them between countries, comply with all relevant laws.

What is the Test Data Management Process?

Companies generally build the Test Data Management process that works for them, meaning there is no one-size-fits-all answer to this question. However, there are a few general buckets that most actions tend to fall into.

Planning

The planning stage is extra important because developing a solid plan for your Test Data Management can save a ton of time. During this phase, it’s essential to understand the full scope of data required for analysis and develop plans for how that data will be accessed, backed up, and stored during analysis. Companies should also prepare a communication plan for departments that will be affected, such as in situations where a system will be inaccessible or have downtime. Finally, a written version of the plan should be created, which improves communication and helps ensure all stakeholders are aligned on the plan.

Extraction

Once the plan has been finalized, the next step is to extract the necessary data. Extraction may be a one-time action, but often it is ongoing per the plan to support long-term business objectives. Generally, companies use Test Data Management tools to help make extraction easier. These tools can simplify and sometimes automate the extraction of data based on an analyst’s criteria.

Security/Compliance

Security and compliance actions may be taken before or after extraction, but it’s essential to take them before running analysis or taking other actions that risk exposing the data. Securing data may involve masking or a more complicated process like anonymization or pseudonymization based on the security needs of the particular use case.

Provision

Now that the data has been secured, it’s time to provision it into a test environment so analysts can begin their work. Generally, analysts provision the data depending on the question they’re trying to answer. However, in some cases, it may make sense to automate this question for repeating events, such as monthly reports.

Maintenance

Companies are constantly generating more data, so analysts need to keep working. Maintaining Test Data Management systems to ensure continued compatibility and functionality and updating systems to align with new business objectives is an integral part of the process.

What Do You Do When You Don’t Have Enough Data?

In some circumstances, you may discover that your company doesn’t have enough data to answer certain questions. For example, the company could be starting a new product line and have only limited information to work with, or the question might be related to a very limited subset of customers.

In these situations, synthetic data comes to the rescue. Synthetic data is an entirely new data set that preserves the statistical relationships between data points in an original data set. Consequently, it can be used to expand the pool of available data in situations where there isn’t enough available for analysis.

However, it’s also ideal for data security and privacy, as it doesn’t just mask identities or create a risk of reidentification like anonymization or pseudonymization. Instead, because the data is entirely new, there’s no underlying user whose data can be exposed, making it ideal for most kinds of analysis while remaining in compliance with privacy laws.

How Mage Data Helps with Test Data Management

Ultimately, Test Data Management is something that companies could do manually. However, doing these things by hand can be very time-consuming and keep employees away from higher-value activities. Mage Data has the tools you need to automate most data management processes and a world-class synthetic data generation tool.

But Mage Data is built for more than just Test Data Management. It provides a complete top-to-bottom platform for data privacy and security, taking care of your data so that you can focus on more important things. To learn more about what Mage Data can do for your business, schedule a demo today!

September 21, 2022
How to Secure Your Critical Sensitive Data in Non-Production and Testing Environments
With businesses across the world embracing digital transformation projects to adapt to modern business requirements, a new challenge has emerged – the increasing usage of data for business-critical functions and protecting the sensitivity of its nature. Within the same organization itself, multiple functions use data in various ways to meet their objectives, adding a layer of complexity for data security professionals who aim to protect the exposure of any sensitive data, but also want to ensure that it does not affect performance. This sensitive data can include employee and customer information, as well as corporate confidential data and intellectual property that can cause wide ramifications by falling into the wrong hands. For organizations that depend on high-quality data for their software development processes but also want to ensure that any sensitive information contained within it is not exposed, a good static data masking tool is a crucial requirement for business operations.

Protect data in non-production environments

A critical aspect of data protection is ensuring the security of sensitive data in development, testing and training (non-production) environments, to eliminate any risk of sensitive data exposure. The same protection methods cannot be used for production and non-production environments as the requirements for both are different. In such cases, de-identifying or masking the data is recommended as a best practice for protecting the sensitive data involved. Masking techniques secure both structured and unstructured fields in the data landscape to allow for testing or quality assurance requirements and user-based access without the risk of sensitive data disclosure.

Maintain integrity of secured data

While securing data, it has also become important for organizations to balance the security and usability of data so that it is relevant enough for use in business analytics, application development, testing, training, and other value-added purposes. Good static data masking tools ensures that the data is anonymized in a manner that retains the usability of data while providing data security.

Choice of anonymization methods

Organizations will have multiple use-cases for data analysis, based on the requirements of the teams that handle this data. In such cases, some anonymization methods can prove to have more value than others depending on the security and performance needs of the relevant teams. These methods can include encryption, tokenization or masking, and good tools will offer different such methods for anonymization that can be used to protect sensitive data effectively.

For years, Mage Data™ has been helping organizations with their data security needs, by providing solutions that include static data masking tools for securing data in non-production environments (Mage Data Static Data Masking).
Some of the features of the Mage Data Static Data Masking tool are as follows:
- 70+ different anonymization methods to protect any sensitive data effectively
- Maintains referential integrity between applications through anonymization methods that gives consistent results across applications and datastores
- Anonymization methods that offer both protection and performance while maintaining its usability
- Encrypts, tokenizes, or masks the data according to the use case that suits the organization
September 8, 2022
6 Common Data Anonymization Mistakes Businesses Make Every Day

Data is a crucial resource for businesses today, but using data legally and ethically often requires data anonymization. Laws like the GDPR in Europe require companies to ensure that personal data is kept private, limiting what companies can do with personal data. Data anonymization allows companies to perform critical operations—like forecasting—with data that preserves the original’s characteristics but lacks the personally identifying data points that could harm its users if leaked or misused.

Despite the importance of data anonymization, there are many mistakes that companies regularly make when performing this process. These companies’ errors are not only dangerous to their users, but could also subject them to regulatory action in a growing number of countries. Here are six of the most-common data anonymization mistakes that you should avoid.

1. Only changing obvious personal identification indicators

One of the trickiest parts of anonymizing a dataset is determining what is or isn’t Personally Identifiable Information (PII) is the kind of information you want to ensure is kept safe. Individual information like date of purchase or the amount paid may not be personal information, but a credit card number or a name would be. Of course, you could go through the dataset by hand and ensure that all relevant data types are anonymized, but there’s still a chance that something slips through the cracks.

For example, if data is in an unstructured column, it may not appear on search results when you’re looking for PII. Or a benign-looking column may exist separately in another table, allowing bad actors to reconstruct the original user identities if they got access to both tables. Small mistakes like these can doom an anonymization project to failure before it even begins.

2. Mistaking synthetic data for anonymized data

Anonymizing or “masking” data takes PII in datasets and alters it so that it can’t be traced back to the original user. Another approach to data security is to instead create “synthetic” datasets. Synthetic datasets attempt to recreate the relationships between data-points in the original dataset while creating an entirely new set of data points.

Synthetic data may or may not live up to its claims of preserving the original relationships. If it doesn’t, it may not be useful for your intended purposes. However, even if the connections are good, treating synthesized data like it’s anonymized or vice versa can lead to mistakes in interpreting the data or ensuring that it is properly stored or distributed.

3. Confusing anonymization with pseudonymization

According to the EU’s GDPR, data is anonymized when it can no longer be reverse engineered to reveal the original PII. Pseudonymization, in comparison, replaces PII with different information of the same type. Pseudonymization doesn’t guarantee that the dataset cannot be reverse engineered if another dataset is brought in to fill in the blanks.

Consequently, anonymized data is generally exempted from GDPR. Pseudonymization is still subject to regulations, albeit reduced relative to normal data. Companies that don’t correctly categorize their data into one bucket or the other could face heavy regulatory action for violating the GDPR or other data laws worldwide.

4. Only anonymizing one data set

One of the common threats we’ve covered so far is the threat of personal information being reconstructed by introducing a non-anonymized database to the mix. There’s an easy solution to that problem. Instead of anonymizing only one dataset, why not anonymize all of the ones that share data. That way, it would be impossible to reconstruct the original data.

Of course, that’s not always going to be possible in a production environment. You may still need the original data for a variety of reasons. However, suppose you’re ever anonymizing data and sending it beyond the bounds of your organization. In that case, you have to consider the variety of interconnections that connect databases, and that may mean that to be safe, you need to anonymize data you don’t release.

5. Anonymizing data—but also destroying it

Data becomes far less valuable if the connections between its points become corrupted or weakened. A poorly executed anonymization process can lead to data that has no value whatsoever. Of course, it’s not always oblivious that this is the case. A casual examination wouldn’t reveal anything wrong, leading companies to draw false conclusions from their data analysis.

That means that a good anonymization process should protect user data and do it in a way where you can be confident that the final results will be what you need.

6. Applying the same anonymization technique to all problems

Sometimes when we have a problem, our natural reaction is to use a solution that worked in the past for a similar problem. However, as you can see from all the examples we’ve explored, the right solution for securing data varies greatly based on what you’re securing, why you’re securing it, and your ultimate plans for that data.

Using the same technique repeatedly can leave you more vulnerable to reverse engineering. Worse, it means that you’re not maximizing the value of each dataset and are possibly over- or under-securing much of your data.

Wrapping Up

Understanding your data is the key to unlocking its potential and keeping PII safe. Many of the issues we outlined in this article do not stem from a lack of technical prowess. Instead, the challenge of dealing with millions or even billions of discrete data points can easily turn a quick project into one that drags out for weeks or months. Or worse, projects can end up “half-completed,” weakening data analysis and security objectives.

Most companies need a program that can do the heavy lifting for them. Mage Data helps organizations find and catalog their data, including highlighting Personally Identifiable Information. Not only is this process automated, but it also uses Natural Language Processing to identify mislabeled PII. Plus, it can help you mask data in a static or a dynamic fashion, ensuring you’re anonymizing data in the manner that best fits your use case. Schedule a demo today to see what Mage Data can do to help your organization better secure its data.

April 29, 2022
5 Common Mistakes Organizations Make During Data Obfuscation

What is Data Obfuscation?

As the name suggests, data obfuscation is the process of hiding sensitive data with modified or other data to secure it. Many are often confused by the term data obfuscation and what it entails, as it is a broad term used for several data security techniques such as anonymization, pseudonymization, masking, encryption, and tokenization.

The need for data obfuscation is omnipresent, with companies needing to achieve business objectives such as auditing, cross-border data sharing, and the like. Apart from this, the high rate of cybercrime is also a pressing reason for companies to invest in technology that can help protect their data, especially now, given the remote working condition due to the Covid pandemic.

Let’s look at some of the best practices you can follow for data obfuscation:

1) Understand your options

It is vital to understand the difference between different data obfuscation techniques such as anonymization and pseudonymization, and encryption, masking, and tokenization. Unless you’re knowledgeable about the various methods of data security and their benefits, you cannot make an informed choice to fulfill your data security needs.

2) Keep in mind the purpose of your data

Of course, the need of the hour is to secure your data. But every data element has a specific purpose. For example, if the data is needed for analytical purposes, you cannot go ahead with a simple encryption algorithm and expect good results. You need to select a technique, such as masking, that will preserve the functionality of the data while ensuring security. The method of obfuscation chosen should facilitate the purpose for which your data is intended.

3) Enable regulatory compliance

Of course, data security is a broader term when compared to compliance, but does being secure mean you’re compliant too? Data protection standards and laws such as HIPAA, PCI DSS, GDPR, and CCPA are limited to a defined area and aim to secure that particular information. So, it is imperative to figure out which of those laws you are required to comply with and implement procedures in place to ensure the same. Security and compliance are not the same – ensure both.

4) Follow the principle of least privilege

The principle of least privilege is the idea that any user, program, or process should have only the bare minimum privileges necessary to perform its function. It works by allowing only enough access to perform the required job. Apart from hiding sensitive data from those unauthorized, data obfuscation techniques like Dynamic Data Masking can also be used to provide user-based access to private information.

5) Use repeatable and irreversible techniques

For the most part, wherever applicable, it would be advisable to use reliable techniques that produce the same results every time. And even if the data were to be seized by a hacker, it shouldn’t be reversible.

Conclusion:

While data obfuscation is important to ensure the protection of your sensitive data, security experts must ensure that they do not implement a solution just to tick a check box. Data Security solutions, when implemented correctly can go a long way to save millions of dollars in revenue for the organization.

September 2, 2021
Test Data Management Best Practices
Do your test data environments put Production data at risk of exposure?

Since test data environments usually require real-world data to tackle complex issues, issues that may not be replicated with fake data, they present one of the most significant security risks to sensitive data. Credentials may not be as secure as for Production, and access may not be as stringently monitored. There’s too much access in some cases. And unauthorized access can reveal troves of production data or other information that can provide a foothold to greater access to protected data or systems.

So how do we enable effective Test Data Management while minimizing risk?

First of all, everyone likely agrees that access should be on the principle of least privilege (limited access to the test environment, and nothing else). Combine that with two-factor authentication as a second line of defense. So far, no problem.

Second, don’t use real data (or mask it if you can’t avoid it).

You have some useful options to minimize the risks of loading real data into a test data environment. Both data subsetting and data virtualization minimize risks while enabling efficiency. Using test data generation enables you to avoid loading real data altogether, and finally, data masking allows you to protect the real data. Let’s take a look at these options.
- Data subsetting consists of taking a subset (usually of a much smaller size as a whole) from one or more production databases. This small size is a significant advantage since it makes both test data distribution and testing much faster than a complete database clone. There are some challenges with this approach. For example, you must have a way of ensuring that your subset is representative of your entire dataset, and it must be referentially intact. And it still exposes Production data.
- Data virtualization has a similar motivation to data subsetting, at its core: take large production databases and make them efficient to distribute and test. However, data subsetting does this by reducing the amount of data; virtualization allows data stored in different types of data models, which are integrated virtually. It doesn’t replicate data from source systems, but only stores the integration logic for viewing. So, there’s still some risk in this method.
- Manual test data generation can be a tedious and time-consuming process; additionally, it can be difficult to manually ensure that all attributes are present in the data to make it “testable.”
- Finally, synthetic data generation breaks with data subsetting and data virtualization by opting to disregard your production data for use as test data. Instead, it allows you to create your own “synthetic” test data. This test data will look real – and will be representative of your production data – while, at the same time, being entirely fake. The biggest obstacle is how to achieve this, making sure your test data covers a range of relevant test cases. A secondary concern is how avoiding making the process so laborious that it loses any benefit over the manual creation of test data.
Each of these options has a drawback that, when you are looking to just get the job done, may mean loading real (production) data in your test data environment. And even with data subsetting and data virtualization, you will be distributing and exposing significant quantities of production data to your testers and leaving it exposed to unauthorized access.

Anonymizing the data is the gold standard in these cases. To make anonymization (masking) successful, these key considerations must be kept in mind:
1. Sensitive data discovery: apply a comprehensive discovery solution to find all of the data that needs to be masked.
2. Referential Integrity: ensure consistency and functionality of data instances during roll-out and consistent masking of the data itself across applications and databases.
3. Data for testing: developers and testers DO NOT need to see the real data. What they do require, however, is realistic data, which preserves formats and passes validations.
4. Efficiency: to ensure efficiency in the masking process, consider performance constraints, security policies, and environmental limitations.
A note of warning: home-grown scripts for data masking are the path of least resistance but are not the most effective — they generally do not eliminate sensitive data and, worse, can cause inconsistency in masking rollouts.

Conclusion

Unless you are using synthetically generated data, you will need to a) find and b) anonymize any sensitive information within your test data before distributing it to your testers. This is usually achieved via comprehensive data Discovery and Static Data Masking capabilities, respectively. Dynamic Data Masking and encryption may also be used as ancillary capabilities to complete the toolkit. There’s no reason to expose data, even in subsets, when anonymization can create a realistic and useful test data environment.

About Mage Data

Mage Data Test Data Management solution includes integrated and comprehensive Discovery, Static, and Dynamic Data Masking solutions, along with a data subsetting option. Additionally, with Mage Data Identities to you can create generalized data sets from internal or external data sources, a process that is a lot more efficient and functionally capable than synthetic data generation. To read more about the Test Data Management market and vendors, download the Bloor TDM market update.
November 11, 2020