Mage Data

Category: Blogs – Intelligent Subsetting

  • Reimagining Test Data: Secure-by-Design Database Virtualization

    Reimagining Test Data: Secure-by-Design Database Virtualization

    Enterprises today are operating in an era of unprecedented data velocity and complexity. The demand for rapid software delivery, continuous testing, and seamless data availability has never been greater. At the same time, organizations face growing scrutiny from regulators, customers, and auditors to safeguard sensitive data across every environment—production, test, or development.

    This dual mandate of speed and security is reshaping enterprise data strategies. As hybrid and multi-cloud infrastructures expand, teams struggle to provision synchronized, compliant, and cost-efficient test environments fast enough to keep up with DevOps cycles. The challenge lies not only in how fast data can move, but in how securely it can be replicated, masked, and managed.

    Database virtualization was designed to solve two of the biggest challenges in Test Data Management—time and cost. Instead of creating multiple full physical copies of production databases, virtualization allows teams to provision lightweight, reusable database instances that share a common data image. This drastically reduces storage requirements and accelerates environment creation, enabling developers and QA teams to work in parallel without waiting for lengthy data refresh cycles. By abstracting data from its underlying infrastructure, database virtualization improves agility, simplifies DevOps workflows, and enhances scalability across hybrid and multi-cloud environments. In short, it brings speed and efficiency to an otherwise resource-heavy process—freeing enterprises to innovate faster.

    Database virtualization was introduced to address inefficiencies in provisioning and environment management. It promised faster test data creation by abstracting databases from their underlying infrastructure. But for many enterprises, traditional approaches have failed to evolve alongside modern data governance and privacy demands.

    Typical pain points include:

    • Storage-Heavy Architectures: Conventional virtualization still relies on partial or full data copies, consuming vast amounts of storage.
    • Slow, Manual Refresh Cycles: Database provisioning often depends on DBAs, leading to delays, inconsistent refreshes, and limited automation.
    • Fragmented Data Privacy Controls: Sensitive data frequently leaves production unprotected, exposing organizations to compliance violations.
    • Limited Integration: Many solutions don’t integrate natively with CI/CD or hybrid infrastructures, making automated delivery pipelines cumbersome.
    • Rising Infrastructure Costs: With exponential data growth, managing physical and virtual copies across clouds and data centers drives up operational expenses.

    The result is an environment that might be faster than before—but still insecure, complex, and costly. To thrive in the AI and automation era, enterprises need secure-by-design virtualization that embeds compliance and efficiency at its core.

    Modern data-driven enterprises require database virtualization that does more than accelerate. It must automate security, enforce privacy, and scale seamlessly across any infrastructure—cloud, hybrid, or on-premises.

    This is where Mage Data’s Database Virtualization (DBV) sets a new benchmark. Unlike traditional tools that treat masking and governance as secondary layers, Mage Data Database Virtualization builds them directly into the virtualization process. Every virtual database created is masked, compliant, and policy-governed by default—ensuring that sensitive information never leaves production unprotected.

    Database Virtualization lightweight, flexible architecture enables teams to provision virtual databases in minutes, without duplicating full datasets or requiring specialized hardware. It’s a unified solution that accelerates innovation while maintaining uncompromising data privacy and compliance.

    1. Instant, Secure Provisioning
      Create lightweight, refreshable copies of production databases on demand. Developers and QA teams can access ready-to-use environments instantly, reducing cycle times from days to minutes.
    2. Built-In Data Privacy and Compliance
      Policy-driven masking ensures that sensitive data remains protected during every clone or refresh. Mage Data Database Virtualization is compliance-ready with frameworks like GDPR, HIPAA, and PCI-DSS, ensuring enterprises maintain regulatory integrity across all environments.
    3. Lightweight, Flexible Architecture
      With no proprietary dependencies or hardware requirements, Database Virtualization integrates effortlessly into existing IT ecosystems. It supports on-premises, cloud, and hybrid infrastructures, enabling consistent management across environments.
    4. CI/CD and DevOps Integration
      DBV integrates natively with Jenkins, GitHub Actions, and other automation tools, empowering continuous provisioning within DevOps pipelines.
    5. Cost and Operational Efficiency
      By eliminating full physical copies, enterprises achieve up to 99% storage savings and dramatically reduce infrastructure, cooling, and licensing costs. Automated refreshes and rollbacks further cut
      manual DBA effort.
    6. Time Travel and Branching (Planned)
      Upcoming capabilities will allow enterprises to rewind databases or create parallel branches, enabling faster debugging and parallel testing workflows.

    The AI-driven enterprise depends on speed—but the right kind of speed: one that doesn’t compromise security or compliance. Mage Data Database Virtualization delivers precisely that. By uniting instant provisioning, storage efficiency, and embedded privacy, it transforms database virtualization from a performance tool into a strategic enabler of governance, innovation, and trust.

    As enterprises evolve to meet the demands of accelerating development, they must modernize their entire approach to data handling—adapting for an AI era where agility, accountability, and assurance must coexist seamlessly.

    Mage Data’s Database Virtualization stands out as the foundation for secure digital transformation—enabling enterprises to accelerate innovation while ensuring privacy and compliance by design.

  • Building Trust in AI: Strengthening Data Protection with Mage Data

    Building Trust in AI: Strengthening Data Protection with Mage Data

    Artificial Intelligence is transforming how organizations analyze, process, and leverage data. Yet, with this transformation comes a new level of responsibility. AI systems depend on vast amounts of sensitive information — personal data, intellectual property, and proprietary business assets — all of which must be handled securely and ethically.

    Across industries, organizations are facing a growing challenge: how to innovate responsibly without compromising privacy or compliance. The European Commission’s General-Purpose AI Code of Practice (GPAI Code), developed under the EU AI Act, provides a structured framework for achieving this balance. It defines clear obligations for AI model providers under Articles 53 and 55, focusing on three key pillars — Safety and Security, Copyright Compliance, and Transparency.

    However, implementing these requirements within complex data ecosystems is not simple. Traditional compliance approaches often rely on manual audits, disjointed tools, and lengthy implementation cycles. Enterprises need a scalable, automated, and auditable framework that bridges the gap between regulatory expectations and real-world data management practices.

    Mage Data Solutions provides that bridge. Its unified data protection platform enables organizations to operate compliance efficiently — automating discovery, masking, monitoring, and lifecycle governance — while maintaining data utility and accelerating AI innovation.

    The GPAI Code establishes a practical model for aligning AI system development with responsible data governance. It is centered around three pillars that define how providers must build and manage AI systems.

    1. Safety and Security
      Organizations must assess and mitigate systemic risks, secure AI model parameters through encryption, protect against insider threats, and enforce multi-factor authentication across access points.
    2. Copyright Compliance
      Data sources used in AI training must respect intellectual property rights, including automated compliance with robots.txt directives and digital rights management. Systems must prevent the generation of copyrighted content.
    3. Transparency and Documentation
      Providers must document their data governance frameworks, model training methods, and decision-making logic. This transparency ensures accountability and allows regulators and stakeholders to verify compliance.

    These pillars form the foundation of the EU’s AI governance model. For enterprises, they serve as both a compliance obligation and a blueprint for building AI systems that are ethical, explainable, and secure.

    Mage Data’s platform directly maps its data protection capabilities to the GPAI Code’s requirements, allowing organizations to implement compliance controls across the full AI lifecycle — from data ingestion to production monitoring.

    GPAI Requirement

    Mage Data Capability

    Compliance Outcome

    Safety & Security (Article 53)

    Sensitive Data Discovery

    Automatically identifies and classifies sensitive information across structured and unstructured datasets, ensuring visibility into data sources before training begins.

    Safety & Security (Article 53)

    Static Data Masking (SDM)

    Anonymizes training data using over 60 proven masking techniques, ensuring AI models are trained on de-identified yet fully functional datasets.

    Safety & Security (Article 53)

    Dynamic Data Masking (DDM)

    Enforces real-time, role-based access controls in production systems, aligning with Zero Trust security principles and protecting live data during AI operations.

    Copyright Compliance (Article 55)

    Data Lifecycle Management

    Automates data retention, archival, and deletion processes, ensuring compliance with intellectual property and “right to be forgotten” requirements.

    Transparency & Documentation (Article 55)

    Database Activity Monitoring

    Tracks every access to sensitive data, generates audit-ready logs, and produces compliance reports for regulatory or internal review.

    Transparency & Accountability

    Unified Compliance Dashboard

    Provides centralized oversight for CISOs, compliance teams, and DPOs to manage policies, monitor controls, and evidence compliance in real time.

    By aligning these modules to the AI Code’s compliance pillars, Mage Data helps enterprises demonstrate accountability, ensure privacy, and maintain operational efficiency.

    Mage Data enables enterprises to transform data protection from a compliance requirement into a strategic capability. The platform’s architecture supports high-scale, multi-environment deployments while maintaining governance consistency across systems.

    Key advantages include:

    • Accelerated Compliance: Achieve AI Act alignment faster than traditional, fragmented methods.
    • Integrated Governance: Replace multiple point solutions with a unified, policy-driven platform.
    • Reduced Risk: Automated workflows minimize human error and prevent data exposure.
    • Proven Scalability: Secures over 2.5 billion data rows and processes millions of sensitive transactions daily.
    • Regulatory Readiness: Preconfigured for GDPR, CCPA, HIPAA, PCI-DSS, and EU AI Act compliance.

    This integrated approach enables security and compliance leaders to build AI systems that are both trustworthy and operationally efficient — ensuring every stage of the data lifecycle is protected and auditable.

    Mage Data provides a clear, step-by-step plan:

    This structured approach takes the guesswork out of compliance and ensures organizations are always audit-ready

    The deadlines for AI Act compliance are approaching quickly. Delaying compliance not only increases costs but also exposes organizations to risks such as:

    • Regulatory penalties that impact global revenue.
    • Data breaches harm brand trust.
    • Missed opportunities, as competitors who comply early gain a reputation for trustworthy, responsible AI.

    By starting today, enterprises can turn compliance from a burden into a competitive advantage.

    The General-Purpose AI Code of Practice sets high standards but meeting them doesn’t have to be slow or costly. With Mage Data’s proven platform, organizations can achieve compliance in weeks, not years — all while protecting sensitive data, reducing risks, and supporting innovation.

    AI is the future. With Mage Data, enterprises can embrace it responsibly, securely, and confidently.

    Ready to get started? Contact Mage Data for a free compliance assessment and see how we can help your organization stay ahead of the curve.

  • TDM 2.0 vs. TDM 1.0: What’s Changed?

    TDM 2.0 vs. TDM 1.0: What’s Changed?

    As digital transformation continues to evolve, test data management (TDM) plays a key role in ensuring data security, compliance, and efficiency. TDM 2.0 introduces significant improvements over TDM 1.0, building on its strengths while incorporating modern, cloud-native technologies. These advancements enhance scalability, integration, and user experience, making TDM 2.0 a more agile and accessible solution. With a focus on self-service capabilities and an intuitive conversational UI, this next-generation approach streamlines test data management, delivering notable improvements in efficiency and performance. 

    Foundation & Scalability  

    Understanding the evolution from TDM 1.0 to TDM 2.0 highlights key improvements in technology and scalability. These enhancements address past limitations and align with modern business needs. 

    Modern Tech Stack vs. Legacy Constraints 
    TDM 1.0 relied on traditional systems that, while reliable, were often constrained by expensive licensing and limited scalability. TDM 2.0 shifts to a cloud-native approach, reducing costs and increasing flexibility.
    • Eliminates reliance on costly database licenses, optimizing resource allocation. 
    • Enables seamless scalability through cloud-native architecture. 
    • Improves performance by facilitating faster updates and alignment with industry standards. 

    This transition ensures that TDM 2.0 is well-equipped to support evolving digital data management needs. 

    Enterprise-Grade Scalability vs. Deployment Bottlenecks 

    Deployment in TDM 1.0 was time-consuming, making it difficult to scale or update efficiently. TDM 2.0 addresses these challenges with modern deployment practices: 

    1. Containerization – Uses Docker for efficient, isolated environments. 
    2. Kubernetes Integration – Supports seamless scaling across distributed systems. 
    3. Automated Deployments – Reduces manual effort, minimizing errors and accelerating rollouts. 

    With these improvements, organizations can deploy updates faster and manage resources more effectively. 

    Ease of Use & Automation  

    User experience is a priority in TDM 2.0, making the platform more intuitive and less dependent on IT support. 

    Conversational UI vs. Complex Navigation 

    TDM 1.0 required multiple steps for simple tasks, creating a steep learning curve. TDM 2.0 simplifies interactions with a conversational UI: 

    • Allows users to create test data and define policies with natural language commands. 
    • Reduces training time, enabling quicker adoption. 
    • Streamlines navigation, making data management more accessible. 

    This user-friendly approach improves efficiency and overall satisfaction. 

    Self-Service Friendly vs. High IT Dependency 

    TDM 2.0 reduces IT reliance by enabling self-service capabilities: 

    1. Users can manage test data independently, freeing IT teams for strategic work. 
    2. Integrated automation tools support customized workflows. 
    Developer-Ready vs. No Test Data Generation  

    A user-friendly interface allows non-technical users to perform complex tasks with ease. These features improve productivity and accelerate project timelines. 

    Data Coverage & Security  

    Comprehensive data support and strong security measures are essential in test data management. TDM 2.0 expands these capabilities significantly. 

    Modern Data Ready vs. Limited Coverage 

    TDM 1.0 had limited compatibility with modern databases. TDM 2.0 addresses this by: 

    • Supporting both on-premise and cloud-based data storage. 
    • Integrating with cloud data warehouses. 
    • Accommodating structured and unstructured data. 

    This broad compatibility allows organizations to manage data more effectively. 

    Secure Data Provisioning with EML vs. In-Place Masking Only 

    TDM 2.0 introduces EML (Extract-Mask-Load) pipelines, offering more flexible and secure data provisioning: 

    • Secure data movement across different storage systems. 
    • Policy-driven data subsetting for optimized security. 
    • Real-time file monitoring for proactive data protection. 

    These enhancements ensure stronger data security and compliance. 

    Governance & Integration  

    Effective data governance and integration are key strengths of TDM 2.0, helping organizations maintain oversight and connectivity. 

    Built-in Data Catalog vs. Limited Metadata Management 

    TDM 2.0 improves data governance by providing a built-in data catalog: 

    1. Centralizes metadata management for easier governance. 
    2. Visualizes data lineage for better transparency. 
    3. Supports integration with existing cataloging tools. 

    This centralized approach improves data oversight and compliance. 

    API-First Approach vs. Limited API Support 

    TDM 2.0 enhances integration with an API-first approach: 

    • Connects with third-party tools, including data catalogs and security solutions. 
    • Supports single sign-on (SSO) for improved security. 
    • Ensures compatibility with various tokenization tools. 

    This flexibility allows organizations to integrate TDM 2.0 seamlessly with their existing and future technologies. 

    Future-Ready Capabilities  

    Organizations need solutions that not only meet current demands but also prepare them for future challenges. TDM 2.0 incorporates key future-ready capabilities. 

    GenAI-Ready vs. No AI/ML Support 

    Unlike TDM 1.0, which lacked AI support, TDM 2.0 integrates with AI and GenAI tools: 

    • Ensures data protection in AI training datasets. 
    • Prevents unauthorized data access. 
    • Supports AI-driven environments for innovative applications. 

    These capabilities position TDM 2.0 as a forward-thinking solution. 

    Future-Ready Capabilities 

    TDM 2.0 is built to handle future demands with: 

    1. Scalability to accommodate growing data volumes. 
    2. Flexibility to adapt to new regulations and compliance requirements. 
    3. Integration capabilities for emerging technologies. 

    By anticipating future challenges, TDM 2.0 helps organizations stay agile and ready for evolving data management needs.

  • Why is Referential Integrity Important in Test Data Management?

    Why is Referential Integrity Important in Test Data Management?

    Finding the best test data management tools requires getting all the major features you need— but that doesn’t mean you can ignore the little ones, either. While maintaining referential integrity might not be the most exciting part of test data management, it can, when executed poorly, be an issue that frustrates your team and makes them less productive. Here’s what businesses need to do to ensure their testing process is as frictionless and efficient as possible.

    What is Referential Integrity?

    Before exploring how referential integrity errors can mislead the testing process, we must first explore what it is. While there are a few different options for storing data at scale, the most common method is the relational database. Relational databases are composed of tables, and tables are made up of rows and columns. Rows, or records, represent individual pieces of information, and each column contains an attribute of the thing. So, a “customer” table, for example, would have a row for each customer and would have columns like “first name,” “last name,” “address,” “phone number,” and so on. Every row in a table also contains a unique identifier called a “key.” Typically, the first row is assigned the key “1”, the second, “2,” and so on.

    The key is important when connecting data between tables. For example, you might have a second table that stores information about purchases. Each row would be an individual transaction, and the columns would be things like the total price, the date, the location at which the purchase was made, and so on. The power of relational databases is that entries in tables can reference other tables based on keys. This approach helps eliminate ambiguity. There might be multiple “John Smiths” in your customer table, but only one will have the unique key “1,” so we can tie transactions to that customer by using their unique key rather than something that there might be multiple of, like a name. Therefore, referential integrity refers to the accuracy and consistency of the relationship between tables.

    How Does Referential Integrity Affect Test Data?

    Imagine a scenario in which a customer, “John Doe,” exercised his right under GDPR or CCPA to have his personal data deleted. As a result of this request, his record in the customer table would be deleted, though the transactions would likely remain, as they aren’t personal data. Now, your developers could be working on a new application that processes transactional data and pulls up user information when someone selects a certain transaction. If John’s transactions were included in the test data used, the test would result in an error whenever one of those transactions came up, as the reference included in those transactions has been deleted.

    The developers’ first reaction wouldn’t necessarily be to look at the underlying data, but to instead assume that there was some sort of bug in the code they had been working on. So, they might write new code, test it, see the error again, and start over a few times before realizing that the underlying data is flawed.

    While that may just sound like a waste of a few hours, this is an extremely basic example. More complex applications could be connecting data through dozens of tables, and the code might be far longer and more complicated…so it can take days for teams to recognize that there isn’t a problem with the code itself but with the data they’re using for testing. Companies need a system that can help them deal with referential integrity issues when creating test data sets, no matter what approach to generating test data they use.

    Referential Integrity in Subsetting

    One approach to generating test data is subsetting. Because your production databases can be very, very large, subsetting creates a copy of some of the database which is more manageable in testing. When it comes to referential integrity, subsetting faces the same issues that using a live production environment does: Someone still needs to scrub through the data and either delete records with missing references or create new dummy records to replace missing ones. This can be a time-consuming and error-prone process.

    Referential Integrity in Anonymized/Pseudonymized datasets

    Anonymization and pseudonymization are two more closely related approaches to test data generation. Pseudonymization takes personally identifiable information and changes it so that it cannot be linked to a real person without combining it with other information stored elsewhere. Anonymization also replaces PII data but does it in a way that is irreversible

    These procedures make the data safer for testing purposes, but the generation process could lead to referential integrity issues. For example, the anonymization process may obscure the relationships between tables, creating reference issues if the program doing the anonymization isn’t equipped to handle the issue across the database as a whole.

    How Mage Data Helps with Test Data Management?

    The key to success with referential integrity in test data management is taking a holistic approach to your data. Mage Data helps companies with every aspect of their data, from data privacy and security to data subject access rights automation, to test data management. This comprehensive approach ensures that businesses can spend less time dealing with frustrating issues like broken references and more time on the tasks that make a real difference. To learn more about Mage’s test data management solution, schedule a demo today.

     

  • 6 Common Data Anonymization Mistakes Businesses Make Every Day

    6 Common Data Anonymization Mistakes Businesses Make Every Day

    Data is a crucial resource for businesses today, but using data legally and ethically often requires data anonymization. Laws like the GDPR in Europe require companies to ensure that personal data is kept private, limiting what companies can do with personal data. Data anonymization allows companies to perform critical operations—like forecasting—with data that preserves the original’s characteristics but lacks the personally identifying data points that could harm its users if leaked or misused.

    Despite the importance of data anonymization, there are many mistakes that companies regularly make when performing this process. These companies’ errors are not only dangerous to their users, but could also subject them to regulatory action in a growing number of countries. Here are six of the most-common data anonymization mistakes that you should avoid.

    1. Only changing obvious personal identification indicators

    One of the trickiest parts of anonymizing a dataset is determining what is or isn’t Personally Identifiable Information (PII) is the kind of information you want to ensure is kept safe. Individual information like date of purchase or the amount paid may not be personal information, but a credit card number or a name would be. Of course, you could go through the dataset by hand and ensure that all relevant data types are anonymized, but there’s still a chance that something slips through the cracks.

    For example, if data is in an unstructured column, it may not appear on search results when you’re looking for PII. Or a benign-looking column may exist separately in another table, allowing bad actors to reconstruct the original user identities if they got access to both tables. Small mistakes like these can doom an anonymization project to failure before it even begins.

    2. Mistaking synthetic data for anonymized data

    Anonymizing or “masking” data takes PII in datasets and alters it so that it can’t be traced back to the original user. Another approach to data security is to instead create “synthetic” datasets. Synthetic datasets attempt to recreate the relationships between data-points in the original dataset while creating an entirely new set of data points.

    Synthetic data may or may not live up to its claims of preserving the original relationships. If it doesn’t, it may not be useful for your intended purposes. However, even if the connections are good, treating synthesized data like it’s anonymized or vice versa can lead to mistakes in interpreting the data or ensuring that it is properly stored or distributed.

    3. Confusing anonymization with pseudonymization

    According to the EU’s GDPR, data is anonymized when it can no longer be reverse engineered to reveal the original PII. Pseudonymization, in comparison, replaces PII with different information of the same type. Pseudonymization doesn’t guarantee that the dataset cannot be reverse engineered if another dataset is brought in to fill in the blanks.

    Consequently, anonymized data is generally exempted from GDPR. Pseudonymization is still subject to regulations, albeit reduced relative to normal data. Companies that don’t correctly categorize their data into one bucket or the other could face heavy regulatory action for violating the GDPR or other data laws worldwide.

    4. Only anonymizing one data set

    One of the common threats we’ve covered so far is the threat of personal information being reconstructed by introducing a non-anonymized database to the mix. There’s an easy solution to that problem. Instead of anonymizing only one dataset, why not anonymize all of the ones that share data. That way, it would be impossible to reconstruct the original data.

    Of course, that’s not always going to be possible in a production environment. You may still need the original data for a variety of reasons. However, suppose you’re ever anonymizing data and sending it beyond the bounds of your organization. In that case, you have to consider the variety of interconnections that connect databases, and that may mean that to be safe, you need to anonymize data you don’t release.

    5. Anonymizing data—but also destroying it

    Data becomes far less valuable if the connections between its points become corrupted or weakened. A poorly executed anonymization process can lead to data that has no value whatsoever. Of course, it’s not always oblivious that this is the case. A casual examination wouldn’t reveal anything wrong, leading companies to draw false conclusions from their data analysis.

    That means that a good anonymization process should protect user data and do it in a way where you can be confident that the final results will be what you need.

    6. Applying the same anonymization technique to all problems

    Sometimes when we have a problem, our natural reaction is to use a solution that worked in the past for a similar problem. However, as you can see from all the examples we’ve explored, the right solution for securing data varies greatly based on what you’re securing, why you’re securing it, and your ultimate plans for that data.

    Using the same technique repeatedly can leave you more vulnerable to reverse engineering. Worse, it means that you’re not maximizing the value of each dataset and are possibly over- or under-securing much of your data.

    Wrapping Up

    Understanding your data is the key to unlocking its potential and keeping PII safe. Many of the issues we outlined in this article do not stem from a lack of technical prowess. Instead, the challenge of dealing with millions or even billions of discrete data points can easily turn a quick project into one that drags out for weeks or months. Or worse, projects can end up “half-completed,” weakening data analysis and security objectives.

    Most companies need a program that can do the heavy lifting for them. Mage Data helps organizations find and catalog their data, including highlighting Personally Identifiable Information. Not only is this process automated, but it also uses Natural Language Processing to identify mislabeled PII. Plus, it can help you mask data in a static or a dynamic fashion, ensuring you’re anonymizing data in the manner that best fits your use case. Schedule a demo today to see what Mage Data can do to help your organization better secure its data.

  • 5 Common Mistakes Organizations Make During Data Obfuscation

    5 Common Mistakes Organizations Make During Data Obfuscation

    What is Data Obfuscation?

    As the name suggests, data obfuscation is the process of hiding sensitive data with modified or other data to secure it. Many are often confused by the term data obfuscation and what it entails, as it is a broad term used for several data security techniques such as anonymization, pseudonymization, masking, encryption, and tokenization.

    The need for data obfuscation is omnipresent, with companies needing to achieve business objectives such as auditing, cross-border data sharing, and the like. Apart from this, the high rate of cybercrime is also a pressing reason for companies to invest in technology that can help protect their data, especially now, given the remote working condition due to the Covid pandemic.

    Let’s look at some of the best practices you can follow for data obfuscation:

    1) Understand your options

    It is vital to understand the difference between different data obfuscation techniques such as anonymization and pseudonymization, and encryption, masking, and tokenization. Unless you’re knowledgeable about the various methods of data security and their benefits, you cannot make an informed choice to fulfill your data security needs.

    2) Keep in mind the purpose of your data

    Of course, the need of the hour is to secure your data. But every data element has a specific purpose. For example, if the data is needed for analytical purposes, you cannot go ahead with a simple encryption algorithm and expect good results. You need to select a technique, such as masking, that will preserve the functionality of the data while ensuring security. The method of obfuscation chosen should facilitate the purpose for which your data is intended.

    3) Enable regulatory compliance

    Of course, data security is a broader term when compared to compliance, but does being secure mean you’re compliant too? Data protection standards and laws such as HIPAA, PCI DSS, GDPR, and CCPA are limited to a defined area and aim to secure that particular information. So, it is imperative to figure out which of those laws you are required to comply with and implement procedures in place to ensure the same. Security and compliance are not the same – ensure both.

    4) Follow the principle of least privilege

    The principle of least privilege is the idea that any user, program, or process should have only the bare minimum privileges necessary to perform its function. It works by allowing only enough access to perform the required job. Apart from hiding sensitive data from those unauthorized, data obfuscation techniques like Dynamic Data Masking can also be used to provide user-based access to private information.

    5) Use repeatable and irreversible techniques

    For the most part, wherever applicable, it would be advisable to use reliable techniques that produce the same results every time. And even if the data were to be seized by a hacker, it shouldn’t be reversible.

    Conclusion:

    While data obfuscation is important to ensure the protection of your sensitive data, security experts must ensure that they do not implement a solution just to tick a check box. Data Security solutions, when implemented correctly can go a long way to save millions of dollars in revenue for the organization.

  • Reidentification Risk of Masked Datasets: Part 2

    Reidentification Risk of Masked Datasets: Part 2

    This article is a continuation of Reidentification Risk of Masked Datasets: Part I, where we discussed how organizations progressed from simple to sophisticated methods of data security, and how even then, they faced the challenge of reidentification. In its conclusion, we shared what companies really need to focus on while anonymizing their data.

    Now, we delve into the subject of reidentification and how to go about achieving your goal, which is ultimately to reduce or eliminate the risk of reidentification.

    Before we dive further into this, it helps to understand a few basic concepts.

    What is reidentification and reidentification risk?

    A direct or an indirect possibility that the original data could be deciphered depends on the dataset and the method of anonymization. This is called reidentification, and the associated risk is appropriately named reidentification risk.

    The NY cab example that we saw in Part 1 of this article is a classic case of reidentification, where the combination of indirectly related factors led to the reidentification of seemingly anonymized personal data.

    Understanding the terms data classification and direct identifiers

    Data classification or an identifier is any data element that can be used to identify an individual, either by itself or in combination with another element, such as name, gender, DOB, employee ID, SSN, age, phone number, ZIP code and so forth. Certain specific data classifications — for instance, employee ID, SSN and phone number — are unique or direct identifiers (i.e., they can be used to uniquely identify an individual). Name and age, on the other hand, are not unique identifiers since there’s repeatability in a large dataset.

    Understanding indirect identifiers and reidentification risk through a simple example

    Let’s say you take a dataset of 100 employees and you’re tasked with finding a specific employee in her 40s. Assume that all direct identifiers have been anonymized. Now you look at indirect identifiers, such as race/ethnicity, city or, say, her bus route — and sure enough, you’ve identified her. Indirect identifiers depend on the dataset and, therefore, are distinct for different datasets. Even though the unique identifiers have been anonymized, you can’t say for sure that an individual can never be identified given that every dataset carries indirect identifiers, leading to the risk of reidentification.

    What are quasi-identifiers?

    Quasi-identifiers are a combination of data classifications that, when considered together, will be able to uniquely identify a person or an entity. As previously mentioned, studies have found that the five-digit ZIP code, birth date and gender form a quasi-identifier, which can uniquely identify 87% of the American population.

    Now that we are on the same page with essential terms, let’s get to the question: How do you go about choosing the right solution that minimizes or eliminates the risk of reidentification while still preserving the functionality of the data?

    The answer lies in taking a risk-versus-value approach.

    First, determine the reidentification risk carried by each identifier in your dataset. Identify its uniqueness and whether it is a direct or indirect identifier, then look at the possible combinations of quasi-identifiers. How high of a risk does a particular identifier pose, either on its own or combined with another piece of data? Depending on how big and diverse a dataset is, there are likely to be quite a few identifiers that can, singly or in combination, reidentify someone. If we assume that is unique and anonymize each based on the risk it poses, regardless of what was done to another data element, you start looking at anonymized datasets that are all data-rich but have removed the possibility of getting back the original data. 

    In other words, this approach maintains demographical logic but not absolute demographics. Now that we’ve preserved data functionality, let’s address eliminating reidentification risk. Remember the quasi-identifier (ZIP Code, DOB and gender) that can uniquely identify 87% of the U.S. population? We can address this issue the same way, by maintaining demographical logic in the anonymized data. For example, Elsie, who was born on 1/21/76, can be anonymized to Annie born on 1/23/76.

    Notice that: 

    1. The gender remains the same (while the name is changed with the same number of characters).
    2. The ZIP code remains the same (while the street is changed).
    3. The date of birth changed by two days (while the month and year remain the same).

    This dataset maintains the same demographic data, which is ideal for analytics, without giving away any PII and, at the same time, plans ahead for and eliminates the risk of reidentification.

    A practical solution lies in keeping the ways in which data can be reidentified in mind when applying anonymization methods. The right approach maintains the dataset’s richness and value — for the purpose it has been originally intended while minimizing or eliminating the reidentification risk of your masked datasets — an approach that makes for a comprehensive and effective data security solution for your sensitive data.

    By taking a risk-based approach to data masking, you can be assured that your data has been truly anonymized.

  • Reidentification Risk of Masked Datasets: Part 1

    Reidentification Risk of Masked Datasets: Part 1

    The following article is also hosted on Forbes Tech Council.

    Data masking, anonymization and pseudonymization aren’t new concepts. In fact, as the founder of a data security company, I’ve had 15 years of experience in helping my clients drive business decisions, enable transactions, allow for secure data sharing in a B2B or consumer scenario, or meet audit and financial requirements.

    One constant in my experience has been the issue of data reidentification. In many instances, companies take a one-size-fits-all easy route to data security, such as using Pretty Good Privacy (PGP) and thinking that 95% of the world won’t be able to reverse it only to find that this assumption is incorrect.

    Take, for example, cases in which customers go into an ERP application and change all Social Security numbers to the most popular combination—999 99 9999. This is highly secure, of course, but renders the data nonfunctional.

    On the other hand, by staying too close to the data, you may ensure that the data performs well, but the trade-off ends up compromising security. For instance, you can add 1 to the Social Security number 777 29 1234, and it becomes 777 29 1235. Or you can add it to the digits in between, and it becomes 777 30 1234. In both cases, though, the data functionality is preserved at the cost of its security. The algorithm can be cracked, and the data is easily reversible.

    Some companies have even kicked the tires on synthetic data (pseudo values generated by algorithms), but that also negatively impacts testing since non-production environments need large volumes of data that carry the complexity of real data in order to test effectively and build the right enhancements. As the name suggests, synthetic data is in no way connected to reality.

    Theoretically speaking, it may seem like a great way to go about securing your information, but no machine algorithm can ever mimic the potency of actual data an organization accrues during its lifespan. Most projects that rely on synthetic data fail for the simple reason that it doesn’t allow for complexity in the interaction between data because, unlike actual data, it’s too uniform.

    There’s always a trade-off: Either your data is highly secure but low in functionality or vice versa. In search of alternatives, companies move to more sophisticated methods of data protection, such as anonymization and masking, to ensure data security in regard to functionality and performance.

    There is a catch, however, and it’s the reason why it’s so important that proper and complete anonymization and masking are critical: Cross-referencing the data with other publicly available data can reidentify an individual from their metadata. As a result, private information such as PFI, PHI and contact information could end up in the public domain. In the wrong hands, this could be catastrophic.

    There have been many incidents in which this has already happened. In 2014, the New York City Taxi and Limousine Commission released anonymized datasets of about 173 million taxi trips. The anonymization, however, was so inadequate that a grad student was able to cross-reference this data with Google images on “celebrities in taxis in Manhattan” to find out celebrities’ pickup and drop-off destinations and even how much they paid their drivers.

    Another example is the Netflix Prize dataset contest, in which poorly anonymized records of customer data were reversed by comparing it against the Internet Movie Database. What is most alarming in situations like these is the information set that can be created by aggregating information from other sources. By combining indirectly related factors, one can reidentify seemingly anonymized personal data.

    Research conducted by the Imperial College London found that “once bought, the data can often be reverse-engineered using machine learning to reidentify individuals, despite the anonymization techniques. This could expose sensitive information about personally identified individuals and allow buyers to build increasingly comprehensive personal profiles of individuals. The research demonstrates for the first time how easily and accurately this can be done — even with incomplete datasets. In the research, 99.98% of Americans were correctly reidentified in any available ‘anonymized’ dataset by using just 15 characteristics, including age, gender and marital status.”

    While you fight between ensuring both data security and data functionality, you find yourself in a bind, choosing what to trade for the other. But does it have to be this way?

    To keep your business going, of course, you’re going to have to make compromises — but in a way that doesn’t challenge the security, performance or functionality of the data nor the compliance or privacy of the individual you’re trying to protect.

    We now have an awareness that data masking, anonymization and pseudonymization might not be as easy as we thought it would be. As a concept, it may seem deceptively simple, but approaching it in a simple, straightforward manner will limit your ability to scale your solution to fit the many ways your business will need to apply anonymization. This inability to scale in terms of masking methodologies ends up in project failure. 

    The focus really needs to be on reducing the risk of reidentification while preserving the functionality of the data (data richness, demographics, etc.). In part two, I will talk about how you can achieve this.

  • Differences between Anonymization and Pseudonymization

    Differences between Anonymization and Pseudonymization

    Under the umbrella of various data protection methods are anonymization and pseudonymization. More often than not, these terms are used interchangeably. But with the introduction of laws such as the GDPR, it becomes necessary to be able to distinguish both techniques clearly as anonymized data and pseudonymized data fall under different categories of the regulation. Moreover, this knowledge also helps organizations make an informed choice in the selection of data protection methods.

    So, let’s break it down. Anonymization is the permanent replacement of sensitive data with unrelated characters, which means that data, once anonymized, cannot be re-identified, wherein lies the difference between both methods. In pseudonymization, the sensitive data is replaced in such a way that it can be re-identified with the help of an identifier (additional information). In short, while anonymization eliminates direct re-identification risk, pseudonymization substitutes the identifiable data with a reversible, consistent value.

    However, it is essential to note that anonymization may sometimes carry the risk of indirect re-identification. For example, let’s say you picked up the novel The Open Window. The author’s name on the book is Saki. But this is a pen name. If you were to pick up another book of his, called The Chronicles of Clovis, you would notice that he has used his real name there, which is H. H. Munro, and that the writing style was similar. Hence, even though you didn’t know that the book was by Munro, you could put two and two together and find out that this is also a book by Saki based on the style of writing.

    The same example could also apply to a shopping experience, where you may not know the name of the customer who made the purchase but may be able to find out who it is if you can identify that this customer has had a constant buying behavior. Every day for the past one year Alex has visited Starbucks at 1500, Broadway at 10:10 am and ordered the same Tall Mocha Frappuccino. Hence, even if his personally identifiable information, such as name, address, etc., has been anonymized or eliminated, his buying behavior still allows you to re-identify him. Therefore, organizations should be meticulous when they anonymize sensitive data, careful to hide any additional information that might aid re-identification.

    There are a variety of methods available to anonymize data, such as directory replacement (modifying the individual’s name while maintaining consistency between values), scrambling (obfuscation; the process can sometimes be reversible), masking (hiding a part of the data with random characters; for example, pseudonymization with identities), personalized anonymization (custom anonymization) and blurring (make meaning of data values obsolete or re-identification of data values impossible). Pseudonymization methods include data encryption (change original data into a ciphertext; can be reversed with a decryption key) and data masking (masking of data while maintaining its usability for different functions). Organizations can select one or more techniques depending on the degree of risk and the intended use of the data.

    Mage Data approaches anonymization and pseudonymization with its leading-edge solutions, named Customers’ Choice 2020 by Gartner Peer Insights. To read more, visit Mage Data Static Data Masking and Mage Data Dynamic Data Masking.