Mage Data strengthens its data security posture with the ISO 27001 certification. READ MORE >

February 13, 2024

Why is Referential Integrity Important in Test Data Management?

Finding the best test data management tools requires getting all the major features you need— but that doesn’t mean you can ignore the little ones, either. While maintaining referential integrity might not be the most exciting part of test data management, it can, when executed poorly, be an issue that frustrates your team and makes them less productive. Here’s what businesses need to do to ensure their testing process is as frictionless and efficient as possible.

What is Referential Integrity?

Before exploring how referential integrity errors can mislead the testing process, we must first explore what it is. While there are a few different options for storing data at scale, the most common method is the relational database. Relational databases are composed of tables, and tables are made up of rows and columns. Rows, or records, represent individual pieces of information, and each column contains an attribute of the thing. So, a “customer” table, for example, would have a row for each customer and would have columns like “first name,” “last name,” “address,” “phone number,” and so on. Every row in a table also contains a unique identifier called a “key.” Typically, the first row is assigned the key “1”, the second, “2,” and so on.

The key is important when connecting data between tables. For example, you might have a second table that stores information about purchases. Each row would be an individual transaction, and the columns would be things like the total price, the date, the location at which the purchase was made, and so on. The power of relational databases is that entries in tables can reference other tables based on keys. This approach helps eliminate ambiguity. There might be multiple “John Smiths” in your customer table, but only one will have the unique key “1,” so we can tie transactions to that customer by using their unique key rather than something that there might be multiple of, like a name. Therefore, referential integrity refers to the accuracy and consistency of the relationship between tables.

How Does Referential Integrity Affect Test Data?

Imagine a scenario in which a customer, “John Doe,” exercised his right under GDPR or CCPA to have his personal data deleted. As a result of this request, his record in the customer table would be deleted, though the transactions would likely remain, as they aren’t personal data. Now, your developers could be working on a new application that processes transactional data and pulls up user information when someone selects a certain transaction. If John’s transactions were included in the test data used, the test would result in an error whenever one of those transactions came up, as the reference included in those transactions has been deleted.

The developers’ first reaction wouldn’t necessarily be to look at the underlying data, but to instead assume that there was some sort of bug in the code they had been working on. So, they might write new code, test it, see the error again, and start over a few times before realizing that the underlying data is flawed.

While that may just sound like a waste of a few hours, this is an extremely basic example. More complex applications could be connecting data through dozens of tables, and the code might be far longer and more complicated…so it can take days for teams to recognize that there isn’t a problem with the code itself but with the data they’re using for testing. Companies need a system that can help them deal with referential integrity issues when creating test data sets, no matter what approach to generating test data they use.

Referential Integrity in Subsetting

One approach to generating test data is subsetting. Because your production databases can be very, very large, subsetting creates a copy of some of the database which is more manageable in testing. When it comes to referential integrity, subsetting faces the same issues that using a live production environment does: Someone still needs to scrub through the data and either delete records with missing references or create new dummy records to replace missing ones. This can be a time-consuming and error-prone process.

Referential Integrity in Anonymized/Pseudonymized Data Sets

Anonymization and pseudonymization are two more closely related approaches to test data generation. Pseudonymization takes personally identifiable information and changes it so that it cannot be linked to a real person without combining it with other information stored elsewhere. Anonymization also replaces PII data but does it in a way that is irreversible.

These procedures make the data safer for testing purposes, but the generation process could lead to referential integrity issues. For example, the anonymization process may obscure the relationships between tables, creating reference issues if the program doing the anonymization isn’t equipped to handle the issue across the database as a whole.

How Mage Helps with Test Data Management

The key to success with referential integrity in test data management is taking a holistic approach to your data. Mage helps companies with every aspect of their data, from data privacy and security to data subject access rights automation, to test data management. This comprehensive approach ensures that businesses can spend less time dealing with frustrating issues like broken references and more time on the tasks that make a real difference. To learn more about Mage’s test data management solution, schedule a demo today.