Mage Data

Category: Blogs – SDM

  • What Are the Best Test Data Management Tools?

    What Are the Best Test Data Management Tools?

    Evaluating Test Data Management tools can be a challenge, especially when you may not be able to get your hands on a product before making a purchase. The good news is that prioritizing the right approach and features can help businesses maximize their ROI with Test Data Management tools. Whether you’re just starting your evaluation process or need to prepare for an imminent purchase, we’ve compiled the information you need to make the best choice for your business.

    What Are the Core Elements of Successful Test Data Management?

    Before examining the best Test Data Management tools in detail, we have to consider what outcomes organizations want to drive with these tools. An effective Test Data Management program helps companies identify bugs and other flaws as early in production as possible to allow for quick remediation while the mistakes are small-scale and inexpensive to rectify.

    Testing of applications is almost guaranteed to fail if you manage your test data in a way that makes it not representative of the data your live applications will use. As a result, your testing will fail to uncover critical flaws, and then your company will face an ever-escalating series of expensive repairs.

    Each new application or part of an application being tested will be slightly different. As a consequence, Test Data Management must be a dynamic process. Testers need to alter datasets from test to test to ensure that each provides comprehensive testing for the feature or tool in play. They will also need to be able to customize the dataset to the specific test being performed. The way the test data is stored in a database can be different from its form when used in a front-end application, so this customization step ensures there won’t be errors related to data incompatibility.

    Testers also clean data before (or after) formatting it for the test. Data cleaning is the process of modifying data to remove incomplete data points and ensuring that data points that would skew the test results are removed or mitigated. Testers will also need to identify sensitive data and ensure the proper steps are taken to protect personal information during testing. Finally, most testers will take steps to automate and scale the process. Many tests are run repeatedly, so having pre-prepared datasets on hand can greatly speed up the process.

    What Features Do the Best Test Data Management Tools Have?

    Given the complexity of the tasks that testers need to perform for compressive and successful testing, getting the right Test Data Management tool is critically important. But that’s easier said than done. The tech world is constantly evolving, and while Test Data Management may seem like a mundane task, it needs to evolve to ensure that it continues to serve your testing teams’ needs. The best Test Data Management tools provide some or all of the following capabilities to ensure that teams are well-equipped for their testing projects.

    Connection to Production Data

    While you could, in theory, create a phony dataset for your testing, the reality is that the data in your production environment will always be the most representative of the challenges your live applications will face. The best Test Data Management tools make it easy for organizations to connect to their live databases and gather the information needed for their tests.

    Efficient Subsetting

    As we covered before, matching your data to the test is critical for success. Subsetting is the process of creating a dataset that matches the test parameters. Generally, this dataset is significantly smaller than your databases as a whole. As a result, testers need an efficient subsetting process that is fast, repeatable, and scalable, so they can spend more time running tests and less preparing.

    Personally Identifiable Information Detection

    An easy way to get your company into trouble with the growing number of data privacy laws online is to use data with Personally Identifiable Information (PII) in it without declaring the use explicitly in your terms of service and getting consent from your users. Consequently, using PII by accident in your testing could result in regulatory action. Test Data Management tools need to help your team avoid this all-too-common scenario by identifying PII, enabling your team to properly anonymize the dataset before it’s used.

    Synthetic Data Generation

    A synthetic dataset is one of the most effective ways to sidestep the PII problem. Synthetic data resembles your source information, but unlike anonymized datasets, it holds little to no chance of being reversed. Because it doesn’t contain PII, it’s not subject to data privacy laws. One risk of synthetic data is that its creation may lead to data points with distorted relationships, compromising testing or analysis. However, Mage’s Data synthetic data solution uses an advanced method that preserves the underlying statistical relationships, even as the data points are recreated. This approach ensures the dataset is representative of your data while guaranteeing that no personal information is put at risk.

    How Do Current Test Data Management Tools Compare?

    Now that we’ve looked at what Test Data Management programs and tools need to do for success, let’s examine how Mage Data fits into the overall marketplace.
    Here we compare Mage’s Data Test Data Management capabilities against:
    • Informatica
    • Delphix
    • Oracle Enterprise Manager
    We consider not only raw capabilities, but other factors that actual users have reported as important.

    Informatica vs. Mage Data

    Informatica is a well-known Test Data Management solution that offers the core features needed for successful Test Data Management, including automated personal data discovery, data subsetting, and data masking. Users have raised issues with the platform though. One common customer complaint is that the tool doesn’t do anything that the competition doesn’t, has a steep learning curve, and is more expensive than competing tools. According to Gartner’s verified reviews, users rated Mage Data higher in every capability category, including dynamic and static data masking, subsetting, sensitive data discovery, and performance and scalability.

    Delphix vs. Mage Data

    Unlike Informatica, Delphix generally has kept up with modern developments in Test Data Management. It receives high ratings across the board for its functionality, and if that were the entire story, it would be hard to pick an alternative. But it doesn’t have an amazing user interface, which makes daily operation uncomfortable and makes setup more challenging than it needs to be. It also isn’t as interoperable as other systems, with some data sources unable to connect, limiting the tool’s usefulness. Contrastingly, Mage Data embraces APIs for connecting to data sources and provides an API setup that allows companies to integrate third-party or legacy solutions for Test Data Management, ensuring that companies aren’t locked out of the functionality they need.

    Oracle Enterprise Manager vs. Mage Data

    Oracle has long been one of the kings of database tools, so it’s unsurprising that Oracle Enterprise Manager’s Test Data Management solution is generally well-regarded. It’s especially praised for both its power and its user-friendliness, which is to be expected from a company of this size. What may surprise you is that based on customer reviews, Mage Data outperforms Oracle Enterprise Manager in 6/6 capability categories, 2/2 for Evaluation & Contracting, 4/4 for Integration & Deployment, and 2/3 for Service & Support. Given Oracle Enterprise Manager’s generally high price, and Mage’s Data better performance across almost all categories, Mage Data clearly comes out on top.

    Why Choose Mage Data for Test Data Management

    For success in Test Data Management, you need a solution that provides comprehensive coverage for each use case you may face. Mage Data knows this, and that’s why its Test Data Management solution provides nearly everything you could need right out of the box, including database cloning and virtualization, data discovery and masking, data subsetting, and synthetic data generation. However, we know that not all use cases will be perfectly met with an off-the-shelf tool, so our system also allows for flexible customization for niche business needs that require a special touch.

    How Mage Data Helps with Test Data Management

    Mage Data is a best-in-class tool for Test Data Management, but that doesn’t mean the benefits stop there. Mage Data provides a suite of data protection tools, enabling businesses to protect every part of their stack from a single dashboard. And, unlike other tools, we’re happy to give you a peek under the hood before you sign on the dotted line. Schedule a demo today to learn more about what Mage Data can do for your business.

  • Cloud Migration and Data Security: Understanding What Needs to Be Done

    Cloud Migration and Data Security: Understanding What Needs to Be Done

    The cloud industry has been experiencing meteoric growth, thanks in no small part to the global pandemic. Companies that were already migrating to the cloud suddenly had to accelerate those plans to continue operating and remain competitive in the shift to a remote workforce. Companies that had resisted the change had to play catch-up, and too often rushed their cloud migration.

    Unfortunately, in that push to move to the cloud, data security can often be a casualty. Even migrating to one of the leading cloud platforms—platforms known for offering industry-leading security, like Azure or AWS—doesn’t automatically guarantee your data will be secure.

    For example, one area where organizations must be particularly careful is migrating medical data to the cloud and remaining compliant with the Health Insurance Portability and Accountability Act (HIPAA) requirements for patient privacy and security. Virtually all of the major cloud providers advertise being HIPAA-compliant, but the burden is still on individual clients to ensure they properly utilize the security tools provided by their chosen cloud provider.

    There are several steps organizations can take to ensure their cloud migration goes as smoothly as possible while providing the needed data security.

    Establish Strong Security Measures First

    One of the biggest steps organizations can take is establishing strong security measures before beginning a cloud migration.

    While data security should be part of any organization’s fundamental operations, migrating to the cloud opens a whole new realm of security threats, including multiple attack vectors, potential security issues with third-party APIs, denial-of-service attacks, account hijacking, and more. Unless an organization takes the time to secure their operations before migrating, they can quickly find themselves overwhelmed by the challenges of storing their data in the cloud.

    Adopt Zero-Trust Security

    On-premise security often focuses on keeping intruders out, with little to no secondary security, should a breach be successful.

    By contrast, proper cloud security focuses on a zero-trust approach, emphasizing security and containment throughout the entire platform (and not just “at the gates”). For example, with a traditional on-premise network, security plans often emphasize strong firewalls designed to keep bad actors out. Because the majority of employees access the company’s network from trusted onsite computers, there’s much less concern about those devices.

    With cloud computing, however, there is no one, single point of entry. Because of its decentralized nature, there are any number of ways a person could gain access to an organization’s resources, making it imperative to establish strong security at every layer of an organization’s cloud presence, trusting no one and no device.

    Data Privacy and HIPAA Compliance

    Data privacy adds another layer of complexity when it comes to cloud migration. A great example of this is the handling of personal data in compliance with HIPAA laws in the U.S. HIPAA encourages the use of electronic data storage but also includes strict requirements for how that data is managed and secured.

    As a result, a number of factors must be considered in order to remain compliant.

    1) Platform vs. Data. One of the most important things to keep in mind when moving HIPAA-sensitive data to the cloud is the distinction between platform and data. While AWS, Azure, and others market their platforms as HIPAA-compliant, each client company is still responsible for making sure they properly secure the data they upload to those platforms. This is why AWS uses terms like “shared responsibility” when describing its HIPAA compliance.

    2) Access Control. Just as an organization must control who has access to data locally, proper access control must be maintained in the cloud, ensuring only authorized parties are able to access sensitive information.

    3) Firewalls. Firewalls are a vital part of a cybersecurity plan, especially one involving cloud-based HIPAA data—a requirement for remaining compliant. In addition, the firewall should provide robust log-in, which can be used to identify attackers and assist any law enforcement efforts.

    4) Encryption. Another requirement for properly secured data is end-to-end encryption (E2EE). E2EE ensures that no third parties can snoop on data in transit, or when it is being stored.

    5) Data De-Identification. Data de-identification is another important step that can and should be taken to protect sensitive data. De-identification removes identifiable information so the data can be accessed and analyzed without risking patient privacy. Hybrid data masking, in particular, can be a powerful tool in this regard. Setting up data masking in the new cloud environment should be a priority for safeguarding personal data.

    6) File Integrity Monitoring. File integrity monitoring is designed to monitor files and flag them if they have been altered or deleted. This can be an invaluable step in catching errors or intrusions as early as possible, and thus mitigating potential damage.

    7) Notification Protocols. Modern data laws require organizations to notify customers of HIPAA-related data breaches. In order to do so in a timely fashion, and prevent further fines, organizations must ensure they have a system in place to quickly and efficiently notify individuals in the event of a breach.

    Prepare for the Data Security of Your Cloud Migration

    Given the sheer number of security risks and privacy issues involved in cloud migration (and the stakes involved in remaining compliant), many organizations are choosing to outsource some or all of their migration to experienced experts.

    Mage Data has a long history of helping companies achieve the data security they need, even during transitional periods like cloud migration. Learn more about our cloud security offerings, or request a demo for our industry-leading Data Security Solutions.

    Related Blogs

    What is HIPAA Compliance?
    What is a Zero-Trust Security Model?

  • What is Homomorphic Encryption and How It’s Used

    What is Homomorphic Encryption and How It’s Used

    Most data encryption is for data that is either at rest or in transit. Most security experts do not consider encryption a viable option for data in use because it’s hard to process and analyze encrypted data. As the need for privacy and security increases, however, there is a perceived need to encrypt data even when it is in use. To encrypt data and use it at the same time is not an easy task. Enter homomorphic encryption.

    What Is Homomorphic Encryption?

    Homomorphic encryption is an emerging type of encryption that allows users or systems to perform operations using encrypted data (without decrypting it first). The result of the operation is also encrypted. Once the result is decrypted, however, it will be exactly the same as it would have been were it computed with the unencrypted data.

    When Should Homomorphic Encryption Be Used?

    Thanks to homomorphic encryption, organizations are able to use cloud computing in external environments while keeping the data there encrypted the entire time. That is, third parties can handle sensitive data without compromising the security or privacy of that data. If the third party becomes compromised in any way, the data will still be secure, because it is never decrypted while it is with the third party.

    Before, it was impossible to outsource certain data processing tasks because of privacy concerns. Because it was necessary to decrypt data to perform computations, the data would be exposed while in use. Homomorphic encryption addresses those concerns. This is a game changer for organizations in a wide variety of industries.

    For example, homomorphic encryption allows healthcare providers to outsource private medical data for computation and analysis. The benefits of homomorphic encryption are not limited to healthcare. As regulations like GDPR become more common and more strict, it becomes crucial to protect personal data at all times, even while performing data analysis on it.

    Is Homomorphic Encryption Practical?

    Homomorphic encryption has been theoretically possible for a long time. The first fully homomorphic encryption schemes are already more than 10 years old. The problem is that the process requires an immense amount of computing power. The herculean effort that goes into this particular type of encryption has prevented it from becoming a viable option for most organizations.

    Now, though, an immense amount of computing power is not as hard to come by as it used to be. We are still not seeing much homomorphic encryption adoption just yet, but more organizations are taking interest.

    Expect to see it become a hot new opportunity in cybersecurity circles as homomorphic encryption becomes more necessary and more attainable at the same time. (The increased necessity is because of strict new rules for data privacy, while the increased attainability is from the exponential growth of computing power).

    Partially Homomorphic vs. Fully Homomorphic Encryption

    There are multiple types of homomorphic encryption schemes. At two ends of the spectrum, cybersecurity experts classify these schemes as partially homomorphic or fully homomorphic. As this type of encryption becomes more viable, people are finding new ways to classify it, introducing new categories between partially and fully homomorphic.

    Currently, we talk about homomorphic encryption in the following ways:

    • Partially Homomorphic Encryption – The lowest level, only supports one type of evaluation (such as multiplication, division, subtraction, addition, etc.)
    • Somewhat Homomorphic Encryption – Supports any type of evaluation, but only for a specific number of ciphertexts
    • Fully Homomorphic Encryption – Supports an infinite amount of computations on any amount of ciphertexts

    As applications of homomorphic encryption become more plausible, expect to see greater nuance emerge. We will see pros and cons of homomorphic encryption that may not be apparent until there are more case studies.

    Potential Vulnerabilities of Homomorphic Encryption

    In March 2022, academics from North Carolina State University and Dokuz Eylul University collaborated to identify a vulnerability in homomorphic encryption. Specifically, researchers showed they could steal data during homomorphic encryption by using a side-channel attack.

    “We were not able to crack homomorphic encryption using mathematical tools,” said Aydin Aysu, an assistant professor of computer engineering at North Carolina State University. “Instead, we used side-channel attacks. Basically, by monitoring power consumption in a device that is encoding data for homomorphic encryption, we are able to read the data as it is being encrypted. This demonstrates that even next generation encryption technologies need protection against side-channel attacks.”

    Is Homomorphic Encryption Safe?

    Before this study scares you away from the potential of homomorphic encryption, it is worth noting a few things:

    1. The vulnerability discovered was only in Microsoft SEAL, an open-source implementation of homomorphic encryption technology.
    2. The researchers were studying versions of Microsoft SEAL released before December 3, 2020. Later versions of the product have replaced the algorithm that created the vulnerability.
    3. The academics did not conclude that this type of homomorphic encryption was entirely unsafe, only that it needed protection from side-channel attacks. And there are established ways to protect against side-channel attacks.

    Does this mean modern homomorphic encryption is necessarily impermeable? No. However, the results of this study are not cause for excessive concern. One big takeaway is that the vulnerability in software from 2020 was not discovered until 2022, when newer versions had already corrected the problem. With commitment to an evolving cybersecurity plan, companies can stay a step ahead of hackers (and academic researchers).

    Assistant Professor Aysu seems confident about the future of homomorphic encryption, as long as organizations also take additional precautions. “As homomorphic encryption moves forward, we need to ensure that we are also incorporating tools and techniques to protect against side-channel attacks,” he says.

    How to Use Homomorphic Encryption

    There are multiple open source homomorphic encryption libraries, and Microsoft SEAL is the most common. It was developed by the Microsoft Research Cryptography Research Group. More cybersecurity experts are becoming interested in homomorphic encryption, and it is getting faster.

    For now, though, it still is not the best option for most organizations. Upon comparing the differences between encryption, tokenization, and masking, most find that masking is currently the best option for data in use.

    Related Blogs

    The Comparative Advantages of Encryption vs. Tokenization vs. Masking

  • What’s the Best Method for Generating Test Data?

    What’s the Best Method for Generating Test Data?

    All data contains secret advantages for your business. You can unlock them through analysis, and they can lead to cost savings, increased sales, a better understanding of your customers and their needs, and myriad other benefits.

    Unfortunately, sometimes bad test data can lead companies astray. For example, IBM estimates that problems resolved during the scoping phase are 15 times less costly to fix than those that make it to production. Getting your test data right is essential to keeping costs low and avoiding unforced mistakes. Here’s what you need to know about creating test data to ensure your business is on the right path.

    What Makes a Test Data Generation Method Good?

    While all data-driven business decisions require good analysis to be effective, good analysis of bad data provides bad results. So, the best test data generation method will be the one that consistently and efficiently produces good data on which you can run your analysis within the context of your business. To ensure that analysis is based on good data, companies should consider the speed, compliance, safety, accuracy, and representation of the various methods to ensure they’re using the best method for their needs.

    Safety

    Companies often hold more personal data than many customers realize and keeping that data safe is an important moral duty. However, test data generation methods are rarely neutral when it comes to safety. They generally either make personal data less safe, or they make it safer.

    Compliance

    Each year, governments pass new data protection laws. If the moral duty to keep data secure wasn’t enough of an incentive, there are fines, lawsuits, and in some countries, prison time, awaiting companies that don’t protect user data and comply with all relevant legislation.

    Speed

    If you or your analysts are waiting on the test data to generate, you’re losing time that could be spent on the analysis itself. Slow data generation can also result in a general unwilling to work with either the most recent or representative historical data, which lowers the potential and quality of your analysis.

    Accuracy and representation

    While one might expect that all test data generation methods would result in accurate and representative data, that’s not the case. Methods vary in accuracy, and some can ultimately produce data that bears little resemblance to the truth. In those situations, your analysis can be done faithfully, but the underlying errors in your data can lead you astray.

    Test Data Generation Methods

    By comparing different methods of test data through the lens of these four categories, we can get a feel for the scenarios in which each technique would succeed or struggle and determine which approaches would be best for most companies.

    Database Cloning

    The oldest method of generating test data on our list is database cloning. The name pretty much gives away how it works: You take a database and copy it. Once you’ve made a copy, you run your analysis on the copy, knowing that any changes you make won’t affect the original data.

    Unfortunately, this method has several shortcomings. For one, it does nothing to secure personal data in the original database. Running analysis can create risks for your users and sometimes get your company into legal trouble.

    It also tends to suffer from speed issues. The bigger your database, the longer it takes to create a copy. Plus, edge cases may be under- or over-represented or even absent from your data, obscuring your results. While this was once the way companies generated test data, given its shortcomings, it’s a good thing that there are better alternatives.

    Database Virtualization

    Database virtualization isn’t a technique solely for creating test data, but it makes the process far easier than using database cloning alone. Virtualized databases are unshackled from their physical hardware, making working with the underlying data extremely fast. Unfortunately, outside of its faster speed, it has all the same shortcomings as database cloning: It does nothing on its own to secure user data, and your tests can only be run on your data, whether it’s representative or not.

    Data Subsetting

    Data subsetting fixes some of the issues found in the previous approaches by taking a limited sample or “subset” of the original database. Because you’re working with a smaller sample, it will tend to be faster, and sometimes using a selection instead of the full dataset can help reduce errors related to edge cases. Still, when using this method, you’re trading speed for representativeness, and there’s still nothing being done to ensure that personal data is protected, which is just asking for trouble.

    Anonymization

    Anonymization fixes the issue with privacy that pervades the previous approaches. And while it’s not a solution for test data generation on its own, it pairs nicely with other approaches. When data is anonymized, individual data points are replaced to protect data that could be used to identify the user who originated the data. This approach makes the data safer to use, especially if you’re sending it outside the company or the country for analysis.

    Unfortunately, anonymization has a fatal flaw: The more anonymized the dataset is, the weaker the connection between data points. Too much anonymization will create a dataset that is useless for analysis. Of course, you could opt for less anonymization within a dataset, but then you risk reidentification if the data ever gets out. What’s a company to do?

    Synthetic Data

    Synthetic data is a surprisingly good solution to most issues with other test data approaches. Like anonymization, it replaces data to secure the underlying personally identifiable information. However, instead of doing it point by point, it works holistically, preserving the individual connections between data while changing the data itself in a way that can’t be reversed.

    That approach gives a lot of advantages. User privacy is protected. Synthetic datasets can be far smaller than the original ones they were generated from, but still represent the whole, giving speed advantages. And, it works well when there’s not a ton of data to be used, either, helping companies run analysis at earlier stages in the process.

    Of course, it’s far more complex than other methods and can be challenging to implement. The good news is that companies don’t have to implement synthetic data on their own.

    Which Test Generation Data Method is Best?

    The best method for a company will vary based on its needs, but based on the relative performance of each approach, most companies will benefit from using synthetic data as their primary test data generation method. Mage’s Data approach to synthetic data can be implemented in an agent or agentless manner, meeting your data where it lives instead of shoehorning a solution that slows everything down. And while it maintains the statistical relationships you need for meaningful analysis, you can also add noise to its synthetic datasets, allowing you to discover new edge cases and opportunities, even if they’ve never appeared in your data before.

    But that’s not all Mage Data can do. Between its static and dynamic masking, it can protect user data when it’s in production use and at rest. Plus, its role-based access controls and automation make tracking use simple. Mage Data is more than just a tool for solving your test data generation problems—it’s a one-platform approach to all your data privacy and security needs. Contact us today to see what Mage Data can do for your organization.

    Related blogs:

    Why Does Test Data Management Matter for Your Business?
    Test Data Management Best Practices

  • Data Retention vs. Data Privacy: What Should Employers Do?

    Data Retention vs. Data Privacy: What Should Employers Do?

    Imagine this scenario: An ex-employee comes to your organization and demands that you delete certain sensitive information from the company database. The head of HR politely explains that, due to certain laws in the U.S., those records need to be kept for three years. The ex-employee threatens to take legal action to have the records deleted, citing current data privacy laws.

    This is not a far-fetched scenario at all. There has always been a tension in the law between requirements for data retention—that is, how long records need to be kept to stay within compliance—and data privacy.

    But the tension has been on people’s minds recently because of “The Great Resignation.” More workers now are leaving their current jobs than at any other time over the past two decades. The U.S. Department of Labor, for example, has been reporting record-high resignation numbers for months, with the latest record of a 3.0% quit rate happening in September 2021.

    Let’s leave aside, for the moment, why people are quitting and how companies are responding. The glaring issue here is that companies now have record numbers of ex-employees. And this is bringing the issue of retaining sensitive employee information to the fore. Combine this with stricter privacy laws and penalties for over-retention, and it’s no wonder data retention has become one of the biggest topics when it comes to data security and data privacy.

    Here at Mage Data, we are not legal experts and do not pretend to give legal advice. But we can say something about the ways in which data should be protected, and how access should be carefully controlled, to satisfy both data retention needs and privacy concerns.

    What Counts as Private Employee Data?

    The first thing to be clear on is that there is no one universal definition, legal or otherwise, for what counts as private or sensitive employee data. But there are clearly some things that everyone agrees fall under this category:

    • Employee addresses/places of residence
    • Social Security numbers
    • Dates of birth
    • Salary information
    • Insurance information
    • Medical records
    • Bank account information

    In general, sensitive data includes anything that an employee would have a “reasonable expectation” would be kept confidential and used only for the employee’s benefit. Thus, it includes the types of information that are regularly gathered by employers to process payroll, manage employee benefit plans, etc.

    The Tension Between Data Privacy and Records Keeping

    Data privacy runs into an issue when it comes to data retention and records keeping. For example, under the U.S. Fair Labor Standards Act (FLSA), employers above a certain size must keep payroll records for at least three years, even after an employee has subsequently left a company.

    Now imagine what needs to happen for a company to be in compliance with, say, the European Union’s GDPR (which is any company doing business in the EU, regardless of whether they have an EU location). Under the GDPR, employees must be informed about:

    • What data of theirs is collected
    • Who owns or controls that data
    • Any third parties that receive their data (such as payroll providers or benefits providers)
    • Their rights and protections under the GDPR

    Because records must be kept for three years, some companies will have a significant amount of sensitive data relating to ex-employees. Thus, these ex-employees will have to be informed about their data and its use, too.

    The GDPR also comes with something called “The Right to Be Forgotten.” In plain English, this amounts to the right to request that personal information be removed from a system. Thus, a former employee can request of a company that any personal data collected during their employment tenure be removed.

    It gets worse. What happens if a company wants to run analytics on, say, benefits use? This will require company data on current and past employees. But the company may very well want to outsource these analytics to a third party. Passing the actual data to an analytics company would trigger a series of steps to stay in compliance with privacy laws—and never mind the hornet’s nest that data stirs up crossing international borders.

    Best Practices for Data Privacy of Ex-Employees

    So can an employee really come and demand that you erase their data? Yes and no.

    The GDPR, for example, clearly states that there are circumstances where an employer can refuse to comply with a request to be forgotten—for example, where that data or its processing is required to be retained by law, or is needed for an ongoing legal case. So, if there is a clear law requiring data retention, this should be followed.

    Things get trickier if the data is beyond the window where retention is required by law. For this reason, many companies are turning to automated solutions for destroying data records according to a pre-ordained schedule (such as our own Data Minimization, part of the Mage Data Minimization suite).

    And for data that is within the retention window, care still needs to be taken. Take the analytics example given above. The transfer of data to third parties is a sensitive undertaking, and the risk of a data breach is much higher. Instead of transmitting sensitive data, it makes more sense to send masked data using a tool that preserves the relationships between data items. This allows third parties to provide useful analytics without having direct access to personal information.

    Finally, it pays to do a regular audit of your data to discover where sensitive employee data lives. Chances are good that a significant amount of employee data “lives” in places that might be missed by routine records deletion. This can create a problem in terms of data privacy. By doing sensitive data discovery, an organization can “plug the holes” when it comes to data privacy laws, either deleting the information or masking it (if it is part of current business processes).

    For more on how Mage Data can help strike a balance between data retention and data privacy, see:

    Dynamic Data Masking with Mage Data
    Sensitive Data Discovery with Mage Data
    Data Minimization with Mage Data

  • How to Secure Your Critical Sensitive Data in Non-Production and Testing Environments

    How to Secure Your Critical Sensitive Data in Non-Production and Testing Environments

    With businesses across the world embracing digital transformation projects to adapt to modern business requirements, a new challenge has emerged – the increasing usage of data for business-critical functions and protecting the sensitivity of its nature. Within the same organization itself, multiple functions use data in various ways to meet their objectives, adding a layer of complexity for data security professionals who aim to protect the exposure of any sensitive data, but also want to ensure that it does not affect performance. This sensitive data can include employee and customer information, as well as corporate confidential data and intellectual property that can cause wide ramifications by falling into the wrong hands. For organizations that depend on high-quality data for their software development processes but also want to ensure that any sensitive information contained within it is not exposed, a good static data masking tool is a crucial requirement for business operations.

    Protect data in non-production environments

    A critical aspect of data protection is ensuring the security of sensitive data in development, testing and training (non-production) environments, to eliminate any risk of sensitive data exposure. The same protection methods cannot be used for production and non-production environments as the requirements for both are different. In such cases, de-identifying or masking the data is recommended as a best practice for protecting the sensitive data involved. Masking techniques secure both structured and unstructured fields in the data landscape to allow for testing or quality assurance requirements and user-based access without the risk of sensitive data disclosure.

    Maintain integrity of secured data

    While securing data, it has also become important for organizations to balance the security and usability of data so that it is relevant enough for use in business analytics, application development, testing, training, and other value-added purposes. Good static data masking tools ensures that the data is anonymized in a manner that retains the usability of data while providing data security.

    Choice of anonymization methods

    Organizations will have multiple use-cases for data analysis, based on the requirements of the teams that handle this data. In such cases, some anonymization methods can prove to have more value than others depending on the security and performance needs of the relevant teams. These methods can include encryption, tokenization or masking, and good tools will offer different such methods for anonymization that can be used to protect sensitive data effectively.

    For years, Mage Data™ has been helping organizations with their data security needs, by providing solutions that include static data masking tools for securing data in non-production environments (Mage Data Static Data Masking).
    Some of the features of the Mage Data Static Data Masking tool are as follows:

    • 70+ different anonymization methods to protect any sensitive data effectively
    • Maintains referential integrity between applications through anonymization methods that gives consistent results across applications and datastores
    • Anonymization methods that offer both protection and performance while maintaining its usability
    • Encrypts, tokenizes, or masks the data according to the use case that suits the organization
  • The Comparative Advantages of Encryption vs. Tokenization vs. Masking

    The Comparative Advantages of Encryption vs. Tokenization vs. Masking

    Any company that handles data (especially any company that handles personal data) will need a method for de-identifying (anonymizing) that data. Any technology for doing so will involve trade-offs. The various methods of de-identification—encryption, tokenization, and masking—will navigate those trade-offs differently.

    This fact has two important consequences. First, the decision of which method to use, and when, has to be made carefully. One must take into consideration the trade-offs between (for example) performance and usability. Second, companies that traffic in data all the time will want a security solution that provides all three options, allowing the organization to tailor their security solution to each use case.

    We’ve previously discussed some of the main differences among encryption, tokenization, and masking; the next step is to look more closely at these trade-offs and the subsequent use cases for each type of anonymization.

    The Security Trade-Off Triangle

    Three of the main qualities needed in a data anonymization solution are security, usability, and performance. We can think of these as forming a triangle; as one gets closer to any one quality, one is likely going to have to trade off the other two.

    Security (Data Re-Identification)

    Security is, of course, the main reason for anonymizing data in the first place. The way in which the various methods differ is in the ease with which data can be de-anonymized—that is, how easy it is for a third party to take a data set and re-identify the items in that set.

    A great example of such re-identification came from a news story several years ago, where data from a New York-based cab company was released according to the Freedom of Information Act. That data, which included over 173 million individual trips and information about the cab driver, had been anonymized using a common technique called hashing. A third party was able to prove that the data could be very easily re-identified—and with a little work, a clever hacker could even infer things like individual cab drivers’ salaries and where they lived.

    A good way to measure the relative security of a process like encryption, tokenization, or masking, then, is to assess how difficult re-identification of the data would be.

    Usability (Analytics)

    The more that a bit of data can be changed, the less risk there is for re-identification. But this also means that the pieces of data lose any kind of relationship to each other, and hence any pattern. The more the pattern is lost, the less useful that data is when doing analysis.

    Take a standard 9-digit Social Security number, for example. We could replace each digit with a single character, say XXXXXXXX or 999999999. This is highly secure, but a database full of Xs will not reveal any useful patterns. In fact, it won’t even be clear that the data are numeric.

    Now consider the other extreme, where we simply increase a single digit by 1. Thus, the Social Security number 987 65 4321 becomes 987 65 4322. In this case, much of the information is preserved. Each unique Social Security number in the database will preserve its relations with other numbers and other pieces of data. The downside is that the algorithm is easily cracked, and the data becomes easily reversible.

    This is a problem for non-production environments, too. Sure, one can obtain test data using pseudo values generated by algorithms. But even in testing environments, one often needs a large volume of data that has the same complexity of real-world data. Pseudo data simply does not have that kind of complexity.

    Performance

    Security happens in the real world, not on paper. Any step added to a data process requires compute time and storage. It is easy for such costs to add up. Having many servers running to handle encryption, for example, will quickly become costly if encryption is being used for every piece of data sent.

    How Do Encryptions, Tokenization, and Masking Compare?

    Again, setting the technical details aside for the moment, the major differences among these methods is the way in which they navigate the trade-offs in this triangle.

    Encryption

    Encryption is best suited for unstructured fields (though it also supports structured), or for databases that aren’t stored in multiple systems. It is also commonly used for protecting files and exchanging data with third parties.

    With encryption, performance varies depending on the time it takes to establish a TCP connection, plus the time for requesting and getting a response from the server. If these connections are being made in the same data center, or to servers that are very responsive, performance will not seem that bad. Performance will degrade, however, if the servers are remote, unresponsive, or simply busy handling a large number of requests.

    Thus, while encryption is a very good method for security of more sensitive information, performance can be an issue if you try to use encryption for all your data.

    Tokenization

    Tokenization is similar to encryption, except that the data in question is replaced by a random string of values (a token) instead of modified by an algorithm. The relationship between the token and original data is preserved in a table on a secure server. When the original data is needed, the application looks up the original relationship between the token and the original data.

    Tokenization always preserves the format of the data, which helps with usability, while maintaining high security. It also tends to create less of a performance hit compared to encryption, though scaling can be an issue if the size of the lookup table becomes too large. And unlike encryption, sharing data with outside parties is tricky because they, too, would need access to the same table.

    Masking

    There are different types of masking, so it is hard to generalize across all of them. One of the more sophisticated approaches to masking is to replace data with pseudo data that nevertheless retains many aspects of the original data, so as to preserve its analytical value without much risk of re-identification.

    When done this way, masking tends to require fewer resources than encryptions, but retains the highest data usability.

    Choosing on a Case-by-Case Basis

    So which method is appropriate for a given organization? That depends, of course, on the needs of the organization, the resources available, and the sensitivity of the data in question. But there need not be a single answer; the method used might vary depending on the specific use case.

    For example, consider a simple email system residing on internal on-premise servers. Encryption might be appropriate for this use as the data are unstructured, the servers are nearby and dedicated for this use, and the need for security might well be high for some communications.

    But now consider an application in a testing environment that will need a large amount of “real-world-like” data. In this case, usability and performance are much more important, and so masking would make more sense.

    And all of this might change if, for example, you find yourself having to undergo a cloud migration.

    The way forward for larger organizations with many and various needs, then, is to find a vendor that can provide all three and help with applying the right techniques in the right circumstances. Here at Mage Data, we aim to gain an understanding of our clients’ data, its characteristics, and its use, so we can help them protect that data appropriately. For more about our anonymization and other security solutions, you can download a data sheet here.

  • 6 Common Data Anonymization Mistakes Businesses Make Every Day

    6 Common Data Anonymization Mistakes Businesses Make Every Day

    Data is a crucial resource for businesses today, but using data legally and ethically often requires data anonymization. Laws like the GDPR in Europe require companies to ensure that personal data is kept private, limiting what companies can do with personal data. Data anonymization allows companies to perform critical operations—like forecasting—with data that preserves the original’s characteristics but lacks the personally identifying data points that could harm its users if leaked or misused.

    Despite the importance of data anonymization, there are many mistakes that companies regularly make when performing this process. These companies’ errors are not only dangerous to their users, but could also subject them to regulatory action in a growing number of countries. Here are six of the most-common data anonymization mistakes that you should avoid.

    1. Only changing obvious personal identification indicators

    One of the trickiest parts of anonymizing a dataset is determining what is or isn’t Personally Identifiable Information (PII) is the kind of information you want to ensure is kept safe. Individual information like date of purchase or the amount paid may not be personal information, but a credit card number or a name would be. Of course, you could go through the dataset by hand and ensure that all relevant data types are anonymized, but there’s still a chance that something slips through the cracks.

    For example, if data is in an unstructured column, it may not appear on search results when you’re looking for PII. Or a benign-looking column may exist separately in another table, allowing bad actors to reconstruct the original user identities if they got access to both tables. Small mistakes like these can doom an anonymization project to failure before it even begins.

    2. Mistaking synthetic data for anonymized data

    Anonymizing or “masking” data takes PII in datasets and alters it so that it can’t be traced back to the original user. Another approach to data security is to instead create “synthetic” datasets. Synthetic datasets attempt to recreate the relationships between data-points in the original dataset while creating an entirely new set of data points.

    Synthetic data may or may not live up to its claims of preserving the original relationships. If it doesn’t, it may not be useful for your intended purposes. However, even if the connections are good, treating synthesized data like it’s anonymized or vice versa can lead to mistakes in interpreting the data or ensuring that it is properly stored or distributed.

    3. Confusing anonymization with pseudonymization

    According to the EU’s GDPR, data is anonymized when it can no longer be reverse engineered to reveal the original PII. Pseudonymization, in comparison, replaces PII with different information of the same type. Pseudonymization doesn’t guarantee that the dataset cannot be reverse engineered if another dataset is brought in to fill in the blanks.

    Consequently, anonymized data is generally exempted from GDPR. Pseudonymization is still subject to regulations, albeit reduced relative to normal data. Companies that don’t correctly categorize their data into one bucket or the other could face heavy regulatory action for violating the GDPR or other data laws worldwide.

    4. Only anonymizing one data set

    One of the common threats we’ve covered so far is the threat of personal information being reconstructed by introducing a non-anonymized database to the mix. There’s an easy solution to that problem. Instead of anonymizing only one dataset, why not anonymize all of the ones that share data. That way, it would be impossible to reconstruct the original data.

    Of course, that’s not always going to be possible in a production environment. You may still need the original data for a variety of reasons. However, suppose you’re ever anonymizing data and sending it beyond the bounds of your organization. In that case, you have to consider the variety of interconnections that connect databases, and that may mean that to be safe, you need to anonymize data you don’t release.

    5. Anonymizing data—but also destroying it

    Data becomes far less valuable if the connections between its points become corrupted or weakened. A poorly executed anonymization process can lead to data that has no value whatsoever. Of course, it’s not always oblivious that this is the case. A casual examination wouldn’t reveal anything wrong, leading companies to draw false conclusions from their data analysis.

    That means that a good anonymization process should protect user data and do it in a way where you can be confident that the final results will be what you need.

    6. Applying the same anonymization technique to all problems

    Sometimes when we have a problem, our natural reaction is to use a solution that worked in the past for a similar problem. However, as you can see from all the examples we’ve explored, the right solution for securing data varies greatly based on what you’re securing, why you’re securing it, and your ultimate plans for that data.

    Using the same technique repeatedly can leave you more vulnerable to reverse engineering. Worse, it means that you’re not maximizing the value of each dataset and are possibly over- or under-securing much of your data.

    Wrapping Up

    Understanding your data is the key to unlocking its potential and keeping PII safe. Many of the issues we outlined in this article do not stem from a lack of technical prowess. Instead, the challenge of dealing with millions or even billions of discrete data points can easily turn a quick project into one that drags out for weeks or months. Or worse, projects can end up “half-completed,” weakening data analysis and security objectives.

    Most companies need a program that can do the heavy lifting for them. Mage Data helps organizations find and catalog their data, including highlighting Personally Identifiable Information. Not only is this process automated, but it also uses Natural Language Processing to identify mislabeled PII. Plus, it can help you mask data in a static or a dynamic fashion, ensuring you’re anonymizing data in the manner that best fits your use case. Schedule a demo today to see what Mage Data can do to help your organization better secure its data.

  • 5 Common Mistakes Organizations Make During Data Obfuscation

    5 Common Mistakes Organizations Make During Data Obfuscation

    What is Data Obfuscation?

    As the name suggests, data obfuscation is the process of hiding sensitive data with modified or other data to secure it. Many are often confused by the term data obfuscation and what it entails, as it is a broad term used for several data security techniques such as anonymization, pseudonymization, masking, encryption, and tokenization.

    The need for data obfuscation is omnipresent, with companies needing to achieve business objectives such as auditing, cross-border data sharing, and the like. Apart from this, the high rate of cybercrime is also a pressing reason for companies to invest in technology that can help protect their data, especially now, given the remote working condition due to the Covid pandemic.

    Let’s look at some of the best practices you can follow for data obfuscation:

    1) Understand your options

    It is vital to understand the difference between different data obfuscation techniques such as anonymization and pseudonymization, and encryption, masking, and tokenization. Unless you’re knowledgeable about the various methods of data security and their benefits, you cannot make an informed choice to fulfill your data security needs.

    2) Keep in mind the purpose of your data

    Of course, the need of the hour is to secure your data. But every data element has a specific purpose. For example, if the data is needed for analytical purposes, you cannot go ahead with a simple encryption algorithm and expect good results. You need to select a technique, such as masking, that will preserve the functionality of the data while ensuring security. The method of obfuscation chosen should facilitate the purpose for which your data is intended.

    3) Enable regulatory compliance

    Of course, data security is a broader term when compared to compliance, but does being secure mean you’re compliant too? Data protection standards and laws such as HIPAA, PCI DSS, GDPR, and CCPA are limited to a defined area and aim to secure that particular information. So, it is imperative to figure out which of those laws you are required to comply with and implement procedures in place to ensure the same. Security and compliance are not the same – ensure both.

    4) Follow the principle of least privilege

    The principle of least privilege is the idea that any user, program, or process should have only the bare minimum privileges necessary to perform its function. It works by allowing only enough access to perform the required job. Apart from hiding sensitive data from those unauthorized, data obfuscation techniques like Dynamic Data Masking can also be used to provide user-based access to private information.

    5) Use repeatable and irreversible techniques

    For the most part, wherever applicable, it would be advisable to use reliable techniques that produce the same results every time. And even if the data were to be seized by a hacker, it shouldn’t be reversible.

    Conclusion:

    While data obfuscation is important to ensure the protection of your sensitive data, security experts must ensure that they do not implement a solution just to tick a check box. Data Security solutions, when implemented correctly can go a long way to save millions of dollars in revenue for the organization.

  • Reidentification Risk of Masked Datasets: Part 2

    Reidentification Risk of Masked Datasets: Part 2

    This article is a continuation of Reidentification Risk of Masked Datasets: Part I, where we discussed how organizations progressed from simple to sophisticated methods of data security, and how even then, they faced the challenge of reidentification. In its conclusion, we shared what companies really need to focus on while anonymizing their data.

    Now, we delve into the subject of reidentification and how to go about achieving your goal, which is ultimately to reduce or eliminate the risk of reidentification.

    Before we dive further into this, it helps to understand a few basic concepts.

    What is reidentification and reidentification risk?

    A direct or an indirect possibility that the original data could be deciphered depends on the dataset and the method of anonymization. This is called reidentification, and the associated risk is appropriately named reidentification risk.

    The NY cab example that we saw in Part 1 of this article is a classic case of reidentification, where the combination of indirectly related factors led to the reidentification of seemingly anonymized personal data.

    Understanding the terms data classification and direct identifiers

    Data classification or an identifier is any data element that can be used to identify an individual, either by itself or in combination with another element, such as name, gender, DOB, employee ID, SSN, age, phone number, ZIP code and so forth. Certain specific data classifications — for instance, employee ID, SSN and phone number — are unique or direct identifiers (i.e., they can be used to uniquely identify an individual). Name and age, on the other hand, are not unique identifiers since there’s repeatability in a large dataset.

    Understanding indirect identifiers and reidentification risk through a simple example

    Let’s say you take a dataset of 100 employees and you’re tasked with finding a specific employee in her 40s. Assume that all direct identifiers have been anonymized. Now you look at indirect identifiers, such as race/ethnicity, city or, say, her bus route — and sure enough, you’ve identified her. Indirect identifiers depend on the dataset and, therefore, are distinct for different datasets. Even though the unique identifiers have been anonymized, you can’t say for sure that an individual can never be identified given that every dataset carries indirect identifiers, leading to the risk of reidentification.

    What are quasi-identifiers?

    Quasi-identifiers are a combination of data classifications that, when considered together, will be able to uniquely identify a person or an entity. As previously mentioned, studies have found that the five-digit ZIP code, birth date and gender form a quasi-identifier, which can uniquely identify 87% of the American population.

    Now that we are on the same page with essential terms, let’s get to the question: How do you go about choosing the right solution that minimizes or eliminates the risk of reidentification while still preserving the functionality of the data?

    The answer lies in taking a risk-versus-value approach.

    First, determine the reidentification risk carried by each identifier in your dataset. Identify its uniqueness and whether it is a direct or indirect identifier, then look at the possible combinations of quasi-identifiers. How high of a risk does a particular identifier pose, either on its own or combined with another piece of data? Depending on how big and diverse a dataset is, there are likely to be quite a few identifiers that can, singly or in combination, reidentify someone. If we assume that is unique and anonymize each based on the risk it poses, regardless of what was done to another data element, you start looking at anonymized datasets that are all data-rich but have removed the possibility of getting back the original data. 

    In other words, this approach maintains demographical logic but not absolute demographics. Now that we’ve preserved data functionality, let’s address eliminating reidentification risk. Remember the quasi-identifier (ZIP Code, DOB and gender) that can uniquely identify 87% of the U.S. population? We can address this issue the same way, by maintaining demographical logic in the anonymized data. For example, Elsie, who was born on 1/21/76, can be anonymized to Annie born on 1/23/76.

    Notice that: 

    1. The gender remains the same (while the name is changed with the same number of characters).
    2. The ZIP code remains the same (while the street is changed).
    3. The date of birth changed by two days (while the month and year remain the same).

    This dataset maintains the same demographic data, which is ideal for analytics, without giving away any PII and, at the same time, plans ahead for and eliminates the risk of reidentification.

    A practical solution lies in keeping the ways in which data can be reidentified in mind when applying anonymization methods. The right approach maintains the dataset’s richness and value — for the purpose it has been originally intended while minimizing or eliminating the reidentification risk of your masked datasets — an approach that makes for a comprehensive and effective data security solution for your sensitive data.

    By taking a risk-based approach to data masking, you can be assured that your data has been truly anonymized.