Author: Alex Ramaiah

Reidentification Risk of Masked Datasets: Part 1

The following article is also hosted on Forbes Tech Council.

Data masking, anonymization and pseudonymization aren’t new concepts. In fact, as the founder of a data security company, I’ve had 15 years of experience in helping my clients drive business decisions, enable transactions, allow for secure data sharing in a B2B or consumer scenario, or meet audit and financial requirements.

One constant in my experience has been the issue of data reidentification. In many instances, companies take a one-size-fits-all easy route to data security, such as using Pretty Good Privacy (PGP) and thinking that 95% of the world won’t be able to reverse it only to find that this assumption is incorrect.

Take, for example, cases in which customers go into an ERP application and change all Social Security numbers to the most popular combination—999 99 9999. This is highly secure, of course, but renders the data nonfunctional.

On the other hand, by staying too close to the data, you may ensure that the data performs well, but the trade-off ends up compromising security. For instance, you can add 1 to the Social Security number 777 29 1234, and it becomes 777 29 1235. Or you can add it to the digits in between, and it becomes 777 30 1234. In both cases, though, the data functionality is preserved at the cost of its security. The algorithm can be cracked, and the data is easily reversible.

Some companies have even kicked the tires on synthetic data (pseudo values generated by algorithms), but that also negatively impacts testing since non-production environments need large volumes of data that carry the complexity of real data in order to test effectively and build the right enhancements. As the name suggests, synthetic data is in no way connected to reality.

Theoretically speaking, it may seem like a great way to go about securing your information, but no machine algorithm can ever mimic the potency of actual data an organization accrues during its lifespan. Most projects that rely on synthetic data fail for the simple reason that it doesn’t allow for complexity in the interaction between data because, unlike actual data, it’s too uniform.

There’s always a trade-off: Either your data is highly secure but low in functionality or vice versa. In search of alternatives, companies move to more sophisticated methods of data protection, such as anonymization and masking, to ensure data security in regard to functionality and performance.

There is a catch, however, and it’s the reason why it’s so important that proper and complete anonymization and masking are critical: Cross-referencing the data with other publicly available data can reidentify an individual from their metadata. As a result, private information such as PFI, PHI and contact information could end up in the public domain. In the wrong hands, this could be catastrophic.

There have been many incidents in which this has already happened. In 2014, the New York City Taxi and Limousine Commission released anonymized datasets of about 173 million taxi trips. The anonymization, however, was so inadequate that a grad student was able to cross-reference this data with Google images on “celebrities in taxis in Manhattan” to find out celebrities’ pickup and drop-off destinations and even how much they paid their drivers.

Another example is the Netflix Prize dataset contest, in which poorly anonymized records of customer data were reversed by comparing it against the Internet Movie Database. What is most alarming in situations like these is the information set that can be created by aggregating information from other sources. By combining indirectly related factors, one can reidentify seemingly anonymized personal data.

Research conducted by the Imperial College London found that “once bought, the data can often be reverse-engineered using machine learning to reidentify individuals, despite the anonymization techniques. This could expose sensitive information about personally identified individuals and allow buyers to build increasingly comprehensive personal profiles of individuals. The research demonstrates for the first time how easily and accurately this can be done — even with incomplete datasets. In the research, 99.98% of Americans were correctly reidentified in any available ‘anonymized’ dataset by using just 15 characteristics, including age, gender and marital status.”

While you fight between ensuring both data security and data functionality, you find yourself in a bind, choosing what to trade for the other. But does it have to be this way?

To keep your business going, of course, you’re going to have to make compromises — but in a way that doesn’t challenge the security, performance or functionality of the data nor the compliance or privacy of the individual you’re trying to protect.

We now have an awareness that data masking, anonymization and pseudonymization might not be as easy as we thought it would be. As a concept, it may seem deceptively simple, but approaching it in a simple, straightforward manner will limit your ability to scale your solution to fit the many ways your business will need to apply anonymization. This inability to scale in terms of masking methodologies ends up in project failure. 

The focus really needs to be on reducing the risk of reidentification while preserving the functionality of the data (data richness, demographics, etc.). In part two, I will talk about how you can achieve this.

November 12, 2020
Data Security Challenges in Financial Service Industry
The industry most targeted by cybercriminals is the financial services industry. Because of the sheer volume of sensitive financial data carried by this industry, it serves as a hotspot for cyberattacks. 47.5% of financial institutions were breached in the past year, while 58.5% have experienced an advanced attack or seen signs of suspicious behavior in their infrastructure.

There are also quite a few regulations that govern this industry in specific, such as the PCI-DSS (Payment Card Industry Data Security Standard), GLBA (Gramm-Leach-Bliley Act, aka the Financial Services Modernization Act), and BCBS 239 (Basel Committee on Banking Supervision’s regulation number 239). Although the industry is heavily regulated, it has a significantly high data breach cost at $5.86 million2. So, data falling into unauthorized hands not only results in non-compliance for the organization but also puts it at financial risk due to the high cost of data breaches

Challenges

Third-party risks

The use of third-party vendors is one of the major forces behind the cybersecurity risk that threatens financial institutions3. Most organizations work with hundreds or even thousands of third parties, creating new risks that must be actively handled. The financial sector has massive third-party networks that pose new weak spots in cyber defense. In the next two years, we can expect to see an exponential rise in attacks on customers, partners, and vendors4. Without continuous monitoring and reporting, along with the use of critical tools to do so, organizations are vulnerable to data breaches and other consequences.

Data transfers (cross-border data exchanges)

The most fundamental challenge is to keep your private data private. Given that the financial sector produces and utilizes a massive amount of sensitive data, and is highly regulated, cybersecurity becomes paramount. Adequate measures are needed to protect data at rest, in use, and in motion.
Security concerns
Problems arise with data security when employees, security officials, and others tasked with protecting sensitive information fail to provide adequate security protocols. They may become careless about leaving their credentials around at home or in public places. Other issues arise when networks and web applications provided by institutions don’t have enough safeguards to keep out hackers looking to steal data.
According to SQN Banking Systems, the five biggest threats to a bank’s cybersecurity include:
• Unencrypted data
• Malware
• Non-secure third-party services
• Manipulated data
• Spoofing

Evolution of technology and the threat landscape

Technology evolves daily; what we’re using now might be obsolete in the coming year. At the same time, cybercriminals are also equipping themselves to face technological advancements head-on. Looking at the alarming number of cybercrimes in the past year, they are much advanced than the technology we are using. In such a scenario where criminals are always one step ahead of the organization, blocking threats becomes a difficult task.

Evolving customer needs and organizations

Just as technology evolves, so do organizations and the way they function. Customer needs are ever-increasing, and they want quick solutions. What customers might not realize is that appealing technology comes with its set of risks. Moreover, there probably isn’t a financial institution today that hasn’t explored digital and mobile platforms. As they continue to keep expanding and using these platforms to cater to their consumers, they’re incidentally open to cyber risk exposure. Retaining customer confidence while meeting their growing needs to newer technologies becomes a complicated process.
Remaining compliant
Alongside dealing with the challenges mentioned above, it is imperative that financial institutions also put in the effort to stay compliant with laws such as the GDPR and CCPA to avoid hefty fines and other concerns such as revenue loss, customer loss, reputation loss, and the like.

Solutions
- Monitor user activity for all actions performed on sensitive data in your enterprise.
- Choose from different methods or select a combination of techniques such as encryption, tokenization, static, and dynamic data masking to secure your data, whether it’s at rest, in use, or in motion.
- Before this step, sensitive data discovery is a must, because if you don’t know where your data is, how will you protect it?
- Deploy consistent and flexible data security approaches that protect sensitive data in high-risk applications without compromising the application architecture.
- Your data security platform should be scalable and well-integrated, which is consistent across all data sources and span both production and non-production environments.
- Finally, ensure the technology you’re implementing is well integrated with existing data protection tools for efficient compliance reporting and breach notifications.
Conclusion

If financial institutions think they can dodge cyberattacks with the help of mediocre data security strategies, recent heists would’ve proved them wrong. And despite prevention and authentication efforts, many make the mistake of thinking anomalous and unauthorized activity will cease to occur, which is unfortunately not the case. While cyber-risk is inevitable, by implementing the right tools, and a well-defined approach to cybersecurity, financial institutions can be more prepared as threats evolve.

About Mage Data

Our data and application security platform is a single integrated platform that protects sensitive data across its lifecycle, with modules for sensitive data discovery, static and dynamic data anonymization, data monitoring, and data minimization. Our solutions in data security are certified, tested, and deployed across a range of customers all over the globe. We have successfully implemented our solution in many large financial institutions – top private bank in the Dominican Republic, one of the largest Swiss Banks, a global financial services software manufacturer, the world’s largest credit rating agency, and a top commercial bank in the United Arab Emirates.

How a Swiss Bank is effectively handling data security?

A top Swiss Bank was looking to optimize costs by offshoring IT application development while ensuring compliance without compromising on sensitive data controls. This posed a need for sensitive data discovery and masking both production and non-production environments. In a highly regulated environment, they were able to deploy our product suite, a sophisticated solution that met their needs and was able to successfully achieve secure cross-border data sharing and sensitive data assessment for cloud migration and compliance.

You can download the case study here: Mage Customer Success Stories – Comprehensive Security Solution for a Top Swiss Bank

References

1) Bitdefender – Top security challenges for the Financial Services Industry in 2018
2) Ponemon Institute – Cost of a Data Breach Report, 2019
3) PwC report – Financial services technology 2020 and beyond: Embracing disruption
4) Protiviti – The Cybersecurity Imperative: Managing Cyber Risks in a World of Rapid Digital Change
5) NGDATA: The Ultimate Data Privacy Guide for Banks and Financial Institutions
November 12, 2020
Data Security Challenges in Healthcare Industry
The healthcare industry is constantly innovating, and has made significant improvements in technology over the last couple of years to enhance patient treatment. For example, the use of AI has changed the game for hospitals around the world, helping physicians make smarter decisions at the point of care, improving the ease and accuracy of viewing patient scans and reducing physician burnout.¹

As more and more of the healthcare sector progress in a digital fashion, unfortunately, it becomes a tempting site of attack for cybercriminals. Just like any other industry, the healthcare industry has faced its share of Mega Data breaches, such as the breaches of Banner Health and Newkirk Products, where close to 4 million people were affected.2 But unlike other industries, the healthcare industry faces the highest data breach cost of $6.45 million. What’s even more alarming is that, at 329 days, it also has the highest duration for identifying and containing a breach.³

Even with laws like the HIPAA which mandate strict standards and processes for the protection and confidential handling of PHI, compliance doesn’t ensure one hundred percent data security, and hence isn’t solely enough to protect hospitals from cybercrime.

Challenges

Let’s go through some of the major data security challenges faced by medical institutions:
- Transfer of Electronic Health Records (EHRs)
The Health Information Technology for Economic and Clinical Health (HITECH) Act encourages healthcare providers to adopt EHRs and Health Information Exchanges (HIEs) so doctors can easily share data with their patients. However, this network of limitless medical information between numerous providers serves as a hotspot for hackers if not protected properly.
- Maintaining compliance
The HITECH Act offers incentives for EHR and HIE adoption. Having said that, it also creates the responsibility of having to maintain compliance. For instance, healthcare providers are required to notify their patients if there’s a breach of their unsecured data. In addition, healthcare institutions also have to comply with laws like the HIPAA, and other data protection regulations like the GDPR or the CCPA, whichever applies to them.
- Inability of end-user to protect medical information
Apart from medical providers having to maintain compliance, the adoption of EHRs also poses a burden in terms of end user errors. Once the user accesses his medical data from the provider’s portal, the privacy of his records is also his responsibility. By sending unsecured data across to anyone else, the user opens up an easy link for hackers to get through. While healthcare organizations are bound by data security laws, the same cannot be said for users, who often as an oversight do not keep up with data security best practices.
- The adoption of digital platforms to store, access and transfer data
The digital progression is very evident as greater number of hospitals move their resources to the cloud and to mobile platforms. The COVID-19 pandemic has also fundamentally changed the face of care provision across the world. Telehealth adoption in the US, for instance, has grown around 3,000% since the start of the crisis, taking much of primary care to people’s homes rather than being necessarily tied to a doctor’s office or hospital.⁴
- Inefficient IT infrastructure
Nobody said running a hospital would be cost efficient. In an episode of one of my favourite TV shows (Grey’s Anatomy), the chief of the hospital decides to cut back on fundamental necessities for the hospital since that money went to expensive medical tech. Sadly, this is true for hospitals in the real world too. While spending adequate money for something like IT infrastructure may seem like a tough decision or unimportant considered to all the other crucial activities that go on in a healthcare organization, it is better than facing the cost of a data breach.
- Evolution of technology vis a vis the threat landscape
As the healthcare sector continues to offer life-critical services while working to improve treatment and patient care with new technologies, criminals and cyber threat actors look to exploit the vulnerabilities that are coupled with these changes. Apart from data breaches, the following are some of the sources of frustration for healthcare IT and cybersecurity specialists:⁵
- Ransomware
- DDoS attacks
- Insider threats
- Business email compromise
- Fraud scams
As healthcare institutions keep enhancing their technology, they’re incidentally open to cyber risk exposure. The COVID-19 outbreak has also not provided any relief in this matter. The INTERPOL Cybercrime Threat Response team findings have detected a significant increase in the number of attempted ransomware attacks against key organizations and infrastructure engaged in the virus response. ⁶

Solutions

Technology is largely the cause for cybercrime, but technology is also what is needed to thwart it. People and processes are not enough; organizations should implement the right technology in place to build a strong data security posture.
- Monitor user activity for all actions performed on sensitive data in your enterprise.
- Choose from different methods or select a combination of techniques such as encryption, tokenization, static, and Dynamic Data Masking to secure your data, whether it’s at rest, in use, or in motion. Before this step, sensitive data discovery is a must, because if you don’t know where your data is, how will you protect it?
- Deploy consistent and flexible data security approaches that protect sensitive data in high-risk applications without compromising the application architecture.
- Your data security platform should be scalable and well-integrated, which is consistent across all data sources and span both production and non-production environments.
- Finally, ensure the technology you’re implementing is well-integrated with existing data protection tools for efficient compliance reporting and breach notifications.
Conclusion

Cybercrime is a menacing threat for any industry, but more so for the healthcare sector, given the high cost of data breaches and the long duration it takes to identify a breach. The outcome of information theft is too great a risk, especially due to the ethical commitment medical providers share with their patients. Building a robust data security platform should be a principal goal of any hospital.

About Mage Data

The Mage Data platform comprises a comprehensive solution that protects sensitive data along its lifecycle in the customer’s systems - providing capabilities from Sensitive Data Discovery, masking, and monitoring to data retirement. Engineered with unique, scalable architecture and built-in separation of duties, it delivers comprehensive, consistent, and reliable data and application security across various data sources (mainframe, relational databases, unstructured data, big data, on-premise, and cloud).

How a leading healthcare company in the US is effectively handling data security

A leading provider of hospital medicine and related facility-based services had an Oracle environment, storing information for more than 2,000 providers in 1,500 facilities. Due to the time required to manage the Oracle data masking tool that had been in place for two and a half years, they looked at the market for a data masking solution that would have ease of use and full automation.

The organization noted several advantages to using the Mage Data Platform instead of Oracle DM, one of the main advantages being the time required to implement and run the software. Apart from a fully automated anonymization solution, the organization was also able to discover many hidden sensitive data locations with the Mage Data sensitive data discovery tool.
Click here to read more: Mage Data Customer Success Stories

References

¹ Cleveland Clinic Newsroom – Cleveland Clinic Unveils Top 10 medical Innovations for 2019
2Digital Guardian Data Insider – Top 10 Biggest Healthcare Data Breaches of All Time
³ Ponemon Institute – Cost of a Data Breach Report, 2019
⁴Healthcare IT News – Digital transformation in the time of COVID-19
⁵ Center for Internet Security (CIS) – Cyber Attacks: In the Healthcare Sector
⁶ Forbes – Cyber Attacks Against Hospitals Have ‘Significantly Increased’ As Hackers Seek To Maximize Profits
ss_entry-meta=”margin-top: -20px;” border_radii=”on|6px|6px|6px|6px” box_shadow_style=”preset1″ box_shadow_spread=”-6px”]
November 12, 2020
Test Data Management Best Practices
Do your test data environments put Production data at risk of exposure?

Since test data environments usually require real-world data to tackle complex issues, issues that may not be replicated with fake data, they present one of the most significant security risks to sensitive data. Credentials may not be as secure as for Production, and access may not be as stringently monitored. There’s too much access in some cases. And unauthorized access can reveal troves of production data or other information that can provide a foothold to greater access to protected data or systems.

So how do we enable effective Test Data Management while minimizing risk?

First of all, everyone likely agrees that access should be on the principle of least privilege (limited access to the test environment, and nothing else). Combine that with two-factor authentication as a second line of defense. So far, no problem.

Second, don’t use real data (or mask it if you can’t avoid it).

You have some useful options to minimize the risks of loading real data into a test data environment. Both data subsetting and data virtualization minimize risks while enabling efficiency. Using test data generation enables you to avoid loading real data altogether, and finally, data masking allows you to protect the real data. Let’s take a look at these options.
- Data subsetting consists of taking a subset (usually of a much smaller size as a whole) from one or more production databases. This small size is a significant advantage since it makes both test data distribution and testing much faster than a complete database clone. There are some challenges with this approach. For example, you must have a way of ensuring that your subset is representative of your entire dataset, and it must be referentially intact. And it still exposes Production data.
- Data virtualization has a similar motivation to data subsetting, at its core: take large production databases and make them efficient to distribute and test. However, data subsetting does this by reducing the amount of data; virtualization allows data stored in different types of data models, which are integrated virtually. It doesn’t replicate data from source systems, but only stores the integration logic for viewing. So, there’s still some risk in this method.
- Manual test data generation can be a tedious and time-consuming process; additionally, it can be difficult to manually ensure that all attributes are present in the data to make it “testable.”
- Finally, synthetic data generation breaks with data subsetting and data virtualization by opting to disregard your production data for use as test data. Instead, it allows you to create your own “synthetic” test data. This test data will look real – and will be representative of your production data – while, at the same time, being entirely fake. The biggest obstacle is how to achieve this, making sure your test data covers a range of relevant test cases. A secondary concern is how avoiding making the process so laborious that it loses any benefit over the manual creation of test data.
Each of these options has a drawback that, when you are looking to just get the job done, may mean loading real (production) data in your test data environment. And even with data subsetting and data virtualization, you will be distributing and exposing significant quantities of production data to your testers and leaving it exposed to unauthorized access.

Anonymizing the data is the gold standard in these cases. To make anonymization (masking) successful, these key considerations must be kept in mind:
1. Sensitive data discovery: apply a comprehensive discovery solution to find all of the data that needs to be masked.
2. Referential Integrity: ensure consistency and functionality of data instances during roll-out and consistent masking of the data itself across applications and databases.
3. Data for testing: developers and testers DO NOT need to see the real data. What they do require, however, is realistic data, which preserves formats and passes validations.
4. Efficiency: to ensure efficiency in the masking process, consider performance constraints, security policies, and environmental limitations.
A note of warning: home-grown scripts for data masking are the path of least resistance but are not the most effective — they generally do not eliminate sensitive data and, worse, can cause inconsistency in masking rollouts.

Conclusion

Unless you are using synthetically generated data, you will need to a) find and b) anonymize any sensitive information within your test data before distributing it to your testers. This is usually achieved via comprehensive data Discovery and Static Data Masking capabilities, respectively. Dynamic Data Masking and encryption may also be used as ancillary capabilities to complete the toolkit. There’s no reason to expose data, even in subsets, when anonymization can create a realistic and useful test data environment.

About Mage Data

Mage Data Test Data Management solution includes integrated and comprehensive Discovery, Static, and Dynamic Data Masking solutions, along with a data subsetting option. Additionally, with Mage Data Identities to you can create generalized data sets from internal or external data sources, a process that is a lot more efficient and functionally capable than synthetic data generation. To read more about the Test Data Management market and vendors, download the Bloor TDM market update.
November 11, 2020
Differences between Anonymization and Pseudonymization

Under the umbrella of various data protection methods are anonymization and pseudonymization. More often than not, these terms are used interchangeably. But with the introduction of laws such as the GDPR, it becomes necessary to be able to distinguish both techniques clearly as anonymized data and pseudonymized data fall under different categories of the regulation. Moreover, this knowledge also helps organizations make an informed choice in the selection of data protection methods.

So, let’s break it down. Anonymization is the permanent replacement of sensitive data with unrelated characters, which means that data, once anonymized, cannot be re-identified, wherein lies the difference between both methods. In pseudonymization, the sensitive data is replaced in such a way that it can be re-identified with the help of an identifier (additional information). In short, while anonymization eliminates direct re-identification risk, pseudonymization substitutes the identifiable data with a reversible, consistent value.

However, it is essential to note that anonymization may sometimes carry the risk of indirect re-identification. For example, let’s say you picked up the novel The Open Window. The author’s name on the book is Saki. But this is a pen name. If you were to pick up another book of his, called The Chronicles of Clovis, you would notice that he has used his real name there, which is H. H. Munro, and that the writing style was similar. Hence, even though you didn’t know that the book was by Munro, you could put two and two together and find out that this is also a book by Saki based on the style of writing.

The same example could also apply to a shopping experience, where you may not know the name of the customer who made the purchase but may be able to find out who it is if you can identify that this customer has had a constant buying behavior. Every day for the past one year Alex has visited Starbucks at 1500, Broadway at 10:10 am and ordered the same Tall Mocha Frappuccino. Hence, even if his personally identifiable information, such as name, address, etc., has been anonymized or eliminated, his buying behavior still allows you to re-identify him. Therefore, organizations should be meticulous when they anonymize sensitive data, careful to hide any additional information that might aid re-identification.

There are a variety of methods available to anonymize data, such as directory replacement (modifying the individual’s name while maintaining consistency between values), scrambling (obfuscation; the process can sometimes be reversible), masking (hiding a part of the data with random characters; for example, pseudonymization with identities), personalized anonymization (custom anonymization) and blurring (make meaning of data values obsolete or re-identification of data values impossible). Pseudonymization methods include data encryption (change original data into a ciphertext; can be reversed with a decryption key) and data masking (masking of data while maintaining its usability for different functions). Organizations can select one or more techniques depending on the degree of risk and the intended use of the data.

Mage Data approaches anonymization and pseudonymization with its leading-edge solutions, named Customers’ Choice 2020 by Gartner Peer Insights. To read more, visit Mage Data Static Data Masking and Mage Data Dynamic Data Masking.

November 10, 2020