July 15, 2022
Discovering 100% of Sensitive Data: Fact or Myth?
Discovering 100 percent of sensitive data is a goal that all companies should strive for. Failing to deal with sensitive data properly is like a food processing line that only removes 99 percent of a contaminant from its products. Would you want to eat something that was 1 percent contaminated? Likewise, suppose even one percent of your sensitive data is missed, leading to improper processing or storage. In that case, your company can be subject to heavy fines, and the risk of a data breach increases exponentially.
To protect yourself and your company, you need to have a rock-solid sensitive data discovery process to be confident that you’re taking the necessary steps to maintain legal compliance, and to protect your users.
What is sensitive data discovery?
One common mistake that companies make when first approaching data discovery is treating it as a tool. Instead, it’s better to think of data discovery as a process. In the normal course of business, data will be collected, categorized, and sorted into a database for storage. In an ideal world, all sensitive data would be identified at this point in the process.
However, sensitive data can be missed or miscategorized for many different reasons. An incorrect setup is a prime suspect, but errors, like a mistake in a user entry or a data point that looks different enough from the typical one of its kind can cause the data to go unnoticed. For example, an email address that a user mistypes as “john2email.com” instead of “firstname.lastname@example.org”, might bypass an initial screening, but would be obvious to a hacker or anyone else with unauthorized access to the data.
So, data discovery is the process of identifying all sensitive data, including that which slipped through the cracks, so that it can be properly labeled, categorized, and protected.
How do companies “do” data discovery?
Unfortunately, some companies fail to do data discovery at all. These companies often have more serious data breaches or, at the very least, fines for noncompliance. For example, in 2018, VTech, a children’s toy manufacturer, settled with the FTC for $650,000 over allegations that it had collected personal information from children without parental consent. And in 2017, Vizio paid a $2.2 million fine for pre-installed software that shared user information with third parties without first obtaining consent.
While you may not feel that you’re doing as poorly in data security as these companies, your methods may still not be pristine. For example, some companies use RegEx or custom scripts for their data discovery . This method may sometimes work, but as you progress towards discovering sensitive data at the enterprise level and have millions or billions of data points, these methods break down.
An infinite number of edge cases can occur when dealing with data, and it’s unlikely that a homegrown data discovery solution deals with them all. To deal with that problem, companies using RegEx must constantly evaluate their results and update their RegEx, which is expensive in terms of labor and time, or accept that they’re letting some data through, which can be a costly mistake.
Finally, most companies use an externally-sourced tool for handling their data discovery. Once your dataset grows so large that humans can’t review it, it’s no longer possible to be 100 percent confident that you’re catching every possible piece of sensitive data. However, your actions can increase or decrease your confidence that you’re hitting that goal. In that context, your sensitive data discovery tool can make your company’s data as secure as possible or merely mislead you into believing that’s the case.
How to evaluate data discovery tools
Because data discovery tools can look like they’re doing a good job, even when they’re not. Evaluating them beforehand to choose the right one for your business is incredibly important. While data discovery tools can come with a variety of different features, these are some that help make your life easier:
One of the biggest challenges with any new data discovery tool is making sure it identifies all personal data. Some tools automatically identify some data types right out of the box, making it faster and easier to get started.
Do you deal with HIPPA, GDPR, or the CCPA? If so, you’ll want a data discovery tool that both identifies the related data points and can generate a report showing that you’re maintaining compliance.
If your company is growing, you need a data discovery tool that scales with you. What works quickly with hundreds of thousands of data points could slow to a crawl when dealing with millions or billions of data points. When you’re scaling, make sure you have a tool ready to scale with you.
Part of the power of data discovery tools is that the best ones run without you having to manually request scans. If you or your employees have to trigger scans of your data manually, you’re losing time that could be better spent on other tasks. Additionally, some data discovery programs have the option to use incremental scans that only run on new data added since the last scan, saving resources and time.
Artificial Intelligence/Natural Language Processing
One of the biggest weaknesses of traditional sensitive data discovery tools is edge cases. With Artificial Intelligence, data discovery tools can better identify sensitive data, even when it doesn’t perfectly match a previously seen edge case. And with natural language processing, these tools can find sensitive data in unstructured fields, like log files, that would otherwise go undetected.
Doing sensitive data discovery right
There have been so many new data privacy laws that have emerged in the last five years that if your data discovery platform is just a few years old, it may already be out of date. And if that’s the case, then discovering 100 percent of your sensitive data is a myth. It won’t happen, even if it looks like it is on the surface.
The growing complexity of data privacy regulation means that you need a tool that is designed for proactive responses to future changes. Here at Mage, we know that you need to be confident that your data discovery tool is 100% of all sensitive data. We’ve built a robust tool with automatic data identification, compliance report generation, scheduled scanning, AI and NLP tools, and the ability to scale with your business. And not only do we offer GDPR, CCPA, and HIPPA compliance, but we already support compliance with regulations like PDPA and LGPD, with more to come as new laws are passed.
Don’t wait until something terrible happens to discover if your data discovery process is behind the times! Schedule a demo today to see what Mage can do to help keep your company and your users safe.