Mage Data strengthens its data security posture with the ISO 27001 certification. READ MORE >

January 3, 2023

Your Data Protection Journey Should Start with True Data Discovery

One of the key services we offer here at Mage is robust data discovery, and for a good reason. We have found that too many organizations try to unearth sensitive data simply through using regular expressions in their database—and they inevitably come to us when those efforts fail.

So, for those organizations out there starting their data protection journeys, please learn from the lessons of others. You can save a lot of time and expense by using a robust data discovery tool. While it might be tempting to try a DIY solution, those likely will be inadequate for some well-known reasons.

Regular Expressions: The Basics

Regex, or regular expressions, are simply sequences of characters that specify a search pattern. With them, a user can specify rules that determine which search strings the system should return. For example, they could be used to search a database to find:

  • All records with a last name
  • All records with a last name of “Smith”
  • All records with a last name beginning with “S”
  • All records with a last name beginning with “S” except those that are “Smith”
  • …and so on.

Naturally, these are not limited to names. Regular expressions can be used to query phone numbers, Social Security numbers, email addresses, credit card numbers…any data that is stored in a database.

This is why they sometimes get used in data discovery. Take Social Security numbers. Knowing the format of Social Security numbers, one could use a regular expression to find all such numbers in a data set and flag them for data masking. That will hide sensitive information, but only if you can catch all (and only) Social Security numbers.

So regular expressions can be thought of as their own kind of programming language, albeit a very specialized language with a very limited number of operators. That simplicity makes them powerful, but also limited in many ways.

Using Regular Expressions (Regex) to Find Sensitive Data Patterns

If regex can be used to find any arbitrary string of data, how could it possibly be limited? There are two ways.

First, the user that created the regex query has to know what they are doing in order to capture all the relevant data. Take U.S. Social Security numbers (SSNs). One might be tempted to simply create a regular expression to capture any nine-digit number. But what if some of the SSNs have dashes, and others do not? And what if some are invalid as SSNs—for example, there are no legitimate SSNs that begin with the digits 666 or 989. The person who creates the regular expression will need to take into account all the different formats and combinations possible in the data.

That leads to the second problem: False positives. It is possible that other forms of data can have a format similar to an SSN, once one takes into account all the different variations. For example, a telephone number that is missing a digit, or national I.D. numbers, can also fit the pattern of an SSN. This will lead to many pieces of data being flagged as sensitive, when they really are not.

But why are false positives a bad thing? If the organization is finding all of the sensitive data, what does it matter if some non-sensitive data is flagged as sensitive, too?

False Positives, Human Intervention, and the Investment of Time

In data discovery, one wants to decrease the number of false positives as much as possible. Too many false positives will overwhelm the search as your team loses the ability to discern actual sensitive data from false positives that live in your data set.

In fact, most organizations require additional human intervention to sift through data and identify false positives from actual hits. This additional human effort takes time.

For example, suppose that a search using regex takes a full day to find the appropriate data. That might seem pretty speedy, but the human effort to weed out false positives might take 10 days thereafter. The data is ready, then, in 11 days.

Compare this to our more robust data discovery that uses artificial intelligence and natural language processing that can understand the context surrounding a piece of data, as well as discover sensitive data in unstructured fields. The process sounds like it will take longer—five days instead of one day with regex! But when the data discovery process is done, the human team will need only a single day to sort out the data and find the few remaining false positives. A process that, overall, used to take 11 days now takes only six.


Method Initial scan (hypothetical) Human review (hypothetical) Total time
Regular expressions (regex) 1 day 10 days 11 days
Mage Data Discovery 5 days 1 day 6 days
Difference 4 days (4X increase) 9 days (9X decrease) 5 days saved


In short, there is a direct correlation between the quality of the data discovery process and the time spent refining the data and weeding out false positives.

Other Issues with Regex

While time is certainly a factor, there are other issues with homebrew regex as well:

Lost distinctions. Regex cannot make distinctions that are not already spelled out. For example, it might be able to return possible credit card numbers, but it won’t specify whether they are Visa or Mastercard numbers. Yet there are plenty of reasons why one might want to know this—for example, appropriately masking data for further analytics.

Bad with unstructured data. Similarly, regex does not do well with unstructured data or data that does not have appropriate context. For example, there might be a mountain of sensitive data sitting within your email system, but regex would do a poor job at uncovering it, as there is not appropriate context.

Where, not how. Regex can find sensitive data and show a user where that data “lives,” but it cannot uncover where the sensitive data came from, nor how it is flowing through the organization, unless that information is contained within the data, too (which it probably isn’t). More robust discovery tools can uncover this flow of data to anticipate future sources of sensitive data.

Skip the Homegrown Scripts

When sensitive data and compliance become an issue for an organization, it is all too common to bring more people onto a team to write scripts using regex for data discovery. The more complex and multi-dimensional the data, the more likely those homegrown scripts will fail.

Whether this has happened to your organization, or you are still considering the possibility, know that there is another alternative.

While contextual data discovery might sound costly and time-consuming at first, it will, in the end, save the entire organization time and money.

If you want to find out more, contact us today.

Related Blogs

Data Discovery Done Right

Data Discovery 101: What Every Business Needs to Know to Secure its Data

Discovering 100% of Sensitive Data: Fact or Myth?