Mage Data

Tag: Test Data Management

  • Why is Referential Integrity Important in Test Data Management?

    Why is Referential Integrity Important in Test Data Management?

    Finding the best test data management tools requires getting all the major features you need— but that doesn’t mean you can ignore the little ones, either. While maintaining referential integrity might not be the most exciting part of test data management, it can, when executed poorly, be an issue that frustrates your team and makes them less productive. Here’s what businesses need to do to ensure their testing process is as frictionless and efficient as possible.

    What is Referential Integrity?

    Before exploring how referential integrity errors can mislead the testing process, we must first explore what it is. While there are a few different options for storing data at scale, the most common method is the relational database. Relational databases are composed of tables, and tables are made up of rows and columns. Rows, or records, represent individual pieces of information, and each column contains an attribute of the thing. So, a “customer” table, for example, would have a row for each customer and would have columns like “first name,” “last name,” “address,” “phone number,” and so on. Every row in a table also contains a unique identifier called a “key.” Typically, the first row is assigned the key “1”, the second, “2,” and so on.

    The key is important when connecting data between tables. For example, you might have a second table that stores information about purchases. Each row would be an individual transaction, and the columns would be things like the total price, the date, the location at which the purchase was made, and so on. The power of relational databases is that entries in tables can reference other tables based on keys. This approach helps eliminate ambiguity. There might be multiple “John Smiths” in your customer table, but only one will have the unique key “1,” so we can tie transactions to that customer by using their unique key rather than something that there might be multiple of, like a name. Therefore, referential integrity refers to the accuracy and consistency of the relationship between tables.

    How Does Referential Integrity Affect Test Data?

    Imagine a scenario in which a customer, “John Doe,” exercised his right under GDPR or CCPA to have his personal data deleted. As a result of this request, his record in the customer table would be deleted, though the transactions would likely remain, as they aren’t personal data. Now, your developers could be working on a new application that processes transactional data and pulls up user information when someone selects a certain transaction. If John’s transactions were included in the test data used, the test would result in an error whenever one of those transactions came up, as the reference included in those transactions has been deleted.

    The developers’ first reaction wouldn’t necessarily be to look at the underlying data, but to instead assume that there was some sort of bug in the code they had been working on. So, they might write new code, test it, see the error again, and start over a few times before realizing that the underlying data is flawed.

    While that may just sound like a waste of a few hours, this is an extremely basic example. More complex applications could be connecting data through dozens of tables, and the code might be far longer and more complicated…so it can take days for teams to recognize that there isn’t a problem with the code itself but with the data they’re using for testing. Companies need a system that can help them deal with referential integrity issues when creating test data sets, no matter what approach to generating test data they use.

    Referential Integrity in Subsetting

    One approach to generating test data is subsetting. Because your production databases can be very, very large, subsetting creates a copy of some of the database which is more manageable in testing. When it comes to referential integrity, subsetting faces the same issues that using a live production environment does: Someone still needs to scrub through the data and either delete records with missing references or create new dummy records to replace missing ones. This can be a time-consuming and error-prone process.

    Referential Integrity in Anonymized/Pseudonymized datasets

    Anonymization and pseudonymization are two more closely related approaches to test data generation. Pseudonymization takes personally identifiable information and changes it so that it cannot be linked to a real person without combining it with other information stored elsewhere. Anonymization also replaces PII data but does it in a way that is irreversible

    These procedures make the data safer for testing purposes, but the generation process could lead to referential integrity issues. For example, the anonymization process may obscure the relationships between tables, creating reference issues if the program doing the anonymization isn’t equipped to handle the issue across the database as a whole.

    How Mage Data Helps with Test Data Management?

    The key to success with referential integrity in test data management is taking a holistic approach to your data. Mage Data helps companies with every aspect of their data, from data privacy and security to data subject access rights automation, to test data management. This comprehensive approach ensures that businesses can spend less time dealing with frustrating issues like broken references and more time on the tasks that make a real difference. To learn more about Mage’s test data management solution, schedule a demo today.

     

  • The ROI of Test Data Management Tool

    The ROI of Test Data Management Tool

    As software teams increasingly take a “shift left” approach to software testing, the need to reduce testing cycle times and improve the rigor of tests is growing in lock-step. This creates a conundrum: Testing coverage and completeness is deeply dependent on the quality of the test dataset used—but provisioning quality test data has to take less time, not more.

    This is where Test Data Management (TDM) tools come into play, giving DevOps teams the resources to provision exactly what they need to test early and often. But, as with anything else, quality TDM tool has a cost associated with it. How can decision makers measure the return on investment (ROI) for such tool?

    To be clear, the issue is not how to do an ROI calculation; there is a well-defined formula for that. The challenge comes with knowing what to measure, and how to translate the functions of TDM tool into concrete cost savings. To get started, it helps to consider the downsides to traditional testing that make TDM attractive, proceeding from there to categorize the areas where TDM tool creates efficiencies as well as new opportunities.

    Traditional Software Testing without TDM—Slow, Ineffective, and Insecure

    The traditional method for generating test data is a largely manual process. A production database would be cloned for the purpose, and then an individual or team would be tasked with creating data subsets and performing other needed functions. This method is inefficient for several reasons:

    • Storage costs. Cloning an entire production database increases storage costs. Although the cost of storage is rather low today, production databases can be large; storing an entire copy is an unnecessary cost.
    • Cloning a database and manually preparing a subset can be a labor-intensive process. According to one survey of DevOps professionals, an average of 3.5 days and 3.8 people were needed to fulfill a request for test data that used production environment data; for 20% of the respondents, the timeframe was over a week.
    • Completeness/edge cases. Missing or misleading edge cases can skew the results of testing. A proper test data subset will need to include important edge cases, but not so many that they overwhelm test results.
    • Referential integrity. When creating a subset, that subset must be representative of the entire dataset. The data model underlying the test data must accurately define the relationships among key pieces of data. Primary keys must be properly linked, and data relationships should be based on well-defined business rules.
    • Ensuring data privacy and compliance. With the increasing number of data security and privacy laws worldwide, it’s important to ensure that your test data generation methods comply with relevant legislation.

    The goal in procuring a TDM tool is to overcome these challenges by automating large parts of the test data procurement process. Thus, the return on such an investment depends on the tool’s ability to guarantee speed, completeness, and referential integrity without consuming too many additional resources or creating compliance issues.

    Efficiency Returns—Driving Down Costs Associated with Testing

    When discussing saved costs, there are two main areas to consider: Internal costs and external ones. Internal costs reflect inefficiencies in process or resource allocation. External costs reflect missed opportunities or problems that arise when bringing a product to market. TDM can help organizations realize a return with both.

    Internal Costs and Test Data Procurement Efficiency

    There is no doubt that testing can happen faster, and sooner, when adequate data is provided more quickly with an automated process. Some industry experts report that, for most organizations, somewhere between 40% and 70% of all test data creation and provisioning can be automated.

    Part of an automated workflow should involve either subsetting the data, or virtualizing it. These steps alleviate the need to store complete copies of production databases, driving down storage costs. Even for a medium-sized organization, this can mean terabytes of saved storage space, with 80% to 90% reductions in storage space being reported by some companies.

    As for overall efficiency, team leaders say their developers are 20% to 25% more efficient when they have access to proper test data management tools.

    External Costs and Competitiveness in the Market

    Most organizations see TDM tools as a way to make testing more efficient, but just as important are the opportunity costs that accrue from slower and more error-prone manual testing. For example, the mean time to the detection of defects (MTTD) will be lower when test data is properly managed, which means software can be improved more quickly, preventing further bugs and client churn. The number of unnoticed defects is likely to decline as well. Catching an error early in development incurs only about one-tenth of the cost of fixing an error in production.

    Time-to-market (TTM) is also a factor here. Traditionally, software projects might have a TTM from six months to several years—but that timeframe is rapidly shrinking. If provisioning of test data takes a week’s worth of time, and there are several testing cycles needed, the delay in TTM due only to data provisioning can be a full month or more. That is not only a month’s worth of lost revenue, but adequate space for a competitor to become more established.

    The Balance

    To review, the cost of any TDM tool and its implementation needs to be balanced against:

    • The cost of storage space for test data
    • The cost of personnel needs (3.8 employees, on average, over 3.5 days)
    • The benefit of an increase in efficiency of your development teams
    • Overall cost of a bug when found in production rather than in testing
    • Lost opportunity due to a slower time-to-market

    TDM Tools Achieve Positive ROI When They Solve These Challenges

    Admittedly, every organization will look different when these factors are assessed. So, while there are general considerations when it comes to the ROI of TDM tools, specific examples will vary wildly. We encourage readers to derive their own estimates for the above numbers.

    That said, the real question is not whether TDM tools provide an ROI. The question is which TDM tools are most likely to do so. Currently available tools differ in terms of their feature sets and ease of use. The better the tool, the higher the ROI will be.

    A tool will achieve positive ROI insofar as it can solve these challenges:

    • Ensuring referential integrity. This can be achieved through proper subsetting and pseudonymization capabilities. The proper number and kind of edge cases should be present, too.
    • Automated provisioning with appropriate security. This means being able to rapidly provision test data across the organization while also staying compliant with all major security and privacy regulations.
    • Scalability and flexibility. The more databases an organization has, the more it will need a tool that can work seamlessly across multiple data platforms. A good tool should have flexible deployment mechanisms to make scalability easy.

    These are specifically the challenges our engineers had in mind when developing Mage’s Data TDM capabilities. Our TDM solution achieves that balance, providing an ROI by helping DevOps teams test more quickly and get to market faster. For more specific numbers and case studies, you can schedule a demo and speak with our team.

  • Why Open-Source Tools Might Fall Short for Test Data Management

    Why Open-Source Tools Might Fall Short for Test Data Management

    You may have heard it said that the best things in life are free—but when it comes to Test Data Management (TDM), free is not always the best choice. For businesses, finding the right balance of value, security, stability, and performance is paramount. And while open-source tools can score well in those areas, there’s a chance that they’ll let you down when you need them most. Here’s what businesses need to know to evaluate open-source test data management tools before they commit.

    What Are Open-Source Tools?

    Before we dive into open-source test data management tools, we need to have a quick conversation about the term “open-source” as the term isn’t always used consistently. Upfront, it’s important to understand that not all free tools are open-source, and because they tend to be community-developed, they don’t have the same expectations around security and customer support that closed-source tools feature.

    Open-source refers to software “designed to be publicly accessible—anyone can see, modify, and distribute the code as they see fit.” Most of the software used in a business context isn’t open-source. For example, common applications like Outlook, Dropbox, or Google Workspace are closed source. The code that powers these applications isn’t published for review, and even if you got access to it, you wouldn’t be able to reuse it in your projects or modify it to run differently.

    Open-source software, by contrast, is intentionally designed so that the code is publicly available. Users are allowed to reuse or modify the code and, in some cases, even contribute new code to the project. Because of its open nature, open-source tools are often developed jointly by a community of passionate developers rather than by a single company. While most open-source tools are free to use, not all software that is free is open-source. An application may be distributed for free, but it’s not open-source if the code isn’t available for reuse, modification, or distribution.

    What are Open-Source TDM Tools Used For?

    For companies, open-source software sometimes makes a lot of sense. They may cost little to nothing to adopt, and if the software has an enthusiastic community, it can often receive free updates improving functionality and security for the foreseeable future. While the feature sets between different open-source test data management tools vary, you could reasonably expect them to do a mixture of the following tasks:

    • Model your data structure
    • Generate test data sets by subsetting
    • Generate synthetic data
    • Provide access rules to restrict who can view data
    • Integrate with a variety of common databases

    Some popular open-source tools in the test data management space include CloudTDMS, ERBuilder, Databucket, and OpenTDM.

    Issues with Open-Source TDM Tools

    For some purposes, the above list may cover all needs. But for businesses with more serious testing needs, there are several issues that can appear when using open-source tools, especially for test data management.

    Limited Functionality and Quality

    One of the core shortcomings of open-source tools is that they’re delivered “as is” at a pace that works for their developers. Unlike software with a team of paid developers, open-source does not guarantee future support. If the application doesn’t have a feature you need now, there’s a chance you may never get it. Unlike paid software, your requests for a new feature may carry no weight with the developers.

    With open-source test data management tools, this primarily creates issues in two areas. The first is user experience. Because these are often unpaid projects, time is a precious commodity. Consequently, development teams tend to spend more time on creating new features, with things like design and user experience being much lower priorities. Unless a designer wants to donate their time, the interfaces you use on the tool may be confusing, slow, or even broken in places.

    The second common issue is in reporting. Most open-source TDM tools come with a limited reporting capability at the very least. However, beyond small businesses with relatively small datasets, these reporting features might not be able to handle the complexity of a modern data environment. This can lead to inaccurate or misleading reporting, which can be especially damaging for businesses.

    Increased Compliance Risk

    Creating and using test data can carry substantial security and privacy risks, as it always begins with personally identifiable information. Under most modern data privacy laws, such as the GDPR or CCPA, documenting how your data is used is necessary for compliance. While you might worry that an open-source tool might leak your data, the reality is that you’ll usually be running such tools locally.

    Instead, it’s more important to consider how well the tool integrates with your existing privacy and security workflow. Is it easy to integrate? Or does the connection keep breaking with each update? Does it provide good visibility into what and how data is being used? Or is it something of a black box? That’s not to say these tools generally have poor connectivity, just that they may not have the full range of integrations and security features you might expect from paid software.

    No Guarantee of Long-Term Availability

    When volunteers run a project, its continued existence often depends on their willingness to keep working for free. While an end to their work might not immediately remove the product from the market, it will eventually fall behind other programs’ features and security. And that means you will eventually need to make a change to avoid security issues or get the latest technology.

    Some businesses will already be planning to upgrade their TDM solution regularly, so that might not be a big deal. For others, changing to something new, even if it’s a new open-source software, means costs in terms of retraining, lost productivity, and possible delays in development during the upgrade. That can be an enormous cost, and open-source solutions are more likely to shut down without significant notice than paid ones.

    Limited Support

    Service-Level Agreements are a huge part of the modern software experience. If something breaks during an upgrade, knowing that you have both easy-to-reach support and a money-back guarantee can provide significant peace of mind. With open-source software, you’re unlikely to have significant support options beyond posting on a forum, and you can forget about an SLA. That doesn’t mean that all open-source solutions are unreliable. However, if something breaks and your team can’t fix it, there’s no way of knowing when it will be fixed.

    How Mage Data Helps with Test Data Management

    For some companies, choosing an open-source test data management system will be a great move. But, some businesses need that extra layer of reliability, security, and compatibility that only paid software can provide. When evaluating these solutions, it’s important to understand the benefits and risks to choose the best option for your business. At Mage, we’ve built a solution designed to handle the most challenging TDM issues, from small businesses to multi-billion-dollar enterprises. Contact us today to schedule a free demo to learn more about what Mage can do for you.

  • What to Look for in a Sensitive Data Discovery Tool

    What to Look for in a Sensitive Data Discovery Tool

    Selecting the right sensitive data discovery tool for your organization can be challenging. Part of the difficulty lies in the fact that you will only get a feel for how effective your choice is after purchasing and implementing it. However, there are things you can do to help maximize your return on investment before you buy by focusing your attention on the right candidates. By selecting your finalists based on their ability to execute on the best practices for sensitive data discovery, you can significantly increase the odds that your final choice is a good fit for your needs.

    Best Practices for Sensitive Data Discovery

    Of course, you can’t effectively select for the best practices in sensitive data discovery without a deep understanding of what they are and how they impact your business. While any number of approaches could be considered “best practices,” here are four that we believe are the most impactful when implementing a new sensitive data discovery system.

    Maximize Automation

    While more automation is almost always good, when it comes to sensitive data discovery, there’s a big difference between increasing automation and maximizing automation. In an ideal world, your data team would configure the rules for detecting personally identifiable information once and then spend their time on higher-value activities like monitoring and reporting. But there’s more to automation than just data types. Is the reporting automated? Does the system work well with the system that handles “right to be forgotten” requests? Any human-driven process is likely to fail when scaled up to millions or billions of data points. Success in this area means finding a solution that maximizes automation and minimizes the burden on your team.

    Merge Classification and Discovery

    Data must be classified before its insights can be unlocked. Despite its similarities to data discovery, data classification is sometimes handled by a different department with different tools. A potential downside of that approach is that a key stakeholder gets a report from each department and asks why the numbers don’t match. As a result, your team is forced to spend time reconciling the different tools’ output—which is not a great place to expend resources. An easy way to fix this problem is to use a single tool to perform both processes. If that’s not a viable approach, ensuring the tools are integrated to produce the same results can be a great way to ensure that your company has a unified and consistent view of its data.

    Develop a Multi-Channel Approach

    One trap that companies sometimes fall into is believing that the discovery process is over once data from outside the company is identified and appropriately secured on the company network. This approach neglects one of the biggest sources of risk when it comes to data: your employees. Are you monitoring your employee endpoints like laptops, desktops etc. for personally identifiable information? If so, are you able to manually or automatically remedy the situation? You won’t always be able to stop employees from making risky moves with data. However, with a multi-channel approach to sensitive data discovery, you can monitor the situation and develop procedures to limit the damage.

    Create Regular Risk Assessments

    Identifying your sensitive data is only the first step in the process. To understand your company’s overall risk, you must deeply understand the relative risk that each piece of sensitive information holds. For example, data moved across borders holds significantly more risk than that in long-term cold storage. Databases that hold customer information inherently have more risk than those holding only corporate information. To meaningfully prioritize your efforts in securing data and optimizing your processes, you need regular risk assessments. At scale, this can be difficult to do on your own—so your sensitive data discovery software either needs to do it for you or have a robust integration with a program that can.

    Choosing the Right Sensitive Data Discovery Software

    While there are many possible ways to select sensitive data discovery tool , the best practices we’ve covered offer a good starting place for most businesses. Remember that the features that one software package has vs. another is not necessarily as important as how those features support your business objectives. Maximizing automation, merging discovery and classification, developing a multi-channel approach, and creating regular risk assessments all have relatively little to do with the actual mechanics of data discovery—but they can all make a huge difference when building a healthy, secure company. There are a lot of different sensitive data discovery solutions that can solve your immediate problem. However, they may not do it in a way that holistically improves your business.

    Another important point is that data discovery is the first step in the data lifecycle that runs all the way to retirement. You could use a different tool for each stage of the process, but the end result would be a system with multiple independent parts that may or may not work well together. Ideally, you would be able to handle data throughout the lifecycle in one application. That’s where Mage Data comes in.

    How Mage Data Helps with Sensitive Data Discovery

    Mage Data’s approach to data security begins with robust data discovery through its patented Mage Sensitive Data Discovery tool, which is powered by artificial intelligence and natural language processing. It can identify more than 80 data types right out-of-the-box and can be configured for as many custom data types as you need.

    But that’s only the start of the process. Mage’s Data Masking solutions provide powerful static and dynamic masking options to keep data safe during the middle of its lifecycle, and Data Minimization tool helps companies handle data access and erasure requests and create robust data retention rules. Having an integrated platform that handles all aspects of data security and privacy can save you money and be far simpler to operate than having different platforms for different operations. We believe that it shouldn’t matter if you’re a small business or enterprise – your data solutions should just work. To learn more about how Mage Data can help you with sensitive data discovery, schedule a demo today.