Organizations are in a race to understand their risk from regulations for the protection and use of personal data, which are on the rise worldwide. With the GDPR in Europe, the CCPA in California, KVKK in Turkey, and several other regulatory actions being enacted around the world, corporations now face serious fiscal consequences and reputational damage that might follow a potential data breach.
At the same time, other requirements, such as the GDPR's data subject access request regulation, freedom of information requests in the US, corporate governance rules, and data loss prevention measures, require corporations to allow for discovery of this sensitive data quickly and securely.
Sources for potentially risky personally identifiable information (PII) can be the obvious unstructured repositories such as file systems and SharePoint, but can also be held inside structured databases as objects.
This all leads to the need for a process in which PII files can be easily identified, indexed, and retrieved.
The ideal end-to-end process should look something like this:
- Discover repositories and identify files.
- Extract all the metadata and, more importantly, the content from the file.
- Analyze file content and metadata for specific entities and/or classify the file based on conceptual content.
- Apply business rules on results of the analysis to place the file in a defined category.
- Take action on the file according to the policies defined for that category.
The overall process is complex. Focus on the major parts of the process and file analysis, and concentrate on the ability to locate PII entities within their contents, and you'll be one step closer. Here's how.
Understand grammars for entities
Grammars are used to describe the entities you are trying to identify, with two basic types available: curated and user-generated.
Here are the grammars to focus on:
- PII—Personally identifiable information, including 13 categories of entities across 38 different countries
- PHI—Personal health information, normally associated with the North American health industry
- PCI—Personal credit card information
- PSI—Personal security information, for account details access keys
Look for curated and optimized grammars that cannot be modified by the user. These grammars use context and landmarks to provide more accurate results. Context and landmarks are used to give a confidence score that can be used later to filter out false positives. These context and landmarks can be phrases, single words, or even just characters.
Their proximity to the identified entity candidate and the strength of the context based on natural-language processing (NLP) techniques are used to create the score. Comprehensive lists of certain entities that use the knowledge of their commonality (country specific) to adjust the confidence scores are key.
When such grammars do not completely cover your use case, use bespoke configurations that allow users to create their own. User grammars can be defined using format descriptive RegX or simple lists, for example. These definitions can then be compiled into a finite state machine (FSM) prior to execution to improve performance.
Multiple grammars can be combined into a single FSM to save resources, but you cannot combine curated and user grammars due to the optimizations already applied to the curated grammars. Any required modification to the curated grammars needs to be requested to be added to the product to improve the corpus.
Categorization for files
Conceptual classification of the file content as a whole can also be used to classify the document, for example as an HR, Finance, or Travel type. This can be used along with found entities when you apply the business rules to get a more accurate result. With machine learning and guided learning on sample documents, you can help define the classifications to be used.
Scanned documents and audio recordings
Text-based documents are not the only potential source of PII risk. Scanned documents and, to a lesser extent, recorded conversations, are also common today. These files need to be pre-processed prior to applying the PII discovery techniques discussed above.
Scanned paper documents stored as images, possibly inside a PDF file, should be processed with OCR to extract the text and, ideally, the associated structural information. Many organizations have scanned ID documents such as driver's licenses or passports held on record for employees.
A combination of face detection and OCR can be used to discover these files. Specifically, face recognition is not used, since this is only required for data subject access requests (DSAR). Audio recordings require processing by a speech-to-text engine to again provide the transcript for analysis.
Speed is king
Speed is important, especially for a data loss prevention (DLP) application, but also just to reduce the resource footprint.
Certain entities are simple and therefore fast to search for (i.e., an eight-digit passport number). More complex entities such as names or addresses are significantly more demanding in resource utilization. In certain circumstances, you can search for simpler grammar as an indicator of a more complex entities presence. This creates potential candidate windows within your document to apply the slower complex search to, for example, ZIP codes for addresses. This method is ideal for documents with low or even no hits, where it will significantly reduce processing times.
Another way to reduce resource requirements is to use entity aggregation to trigger an early exit of the analysis. This involves looking at the count and type but not the value of an entity. Once you have found sufficient entities, you can flag the document as sensitive and stop further analysis (for example, one name and one address or two credit card numbers).
What are the gotchas?
You must fully understand the business outcome you need to achieve in order to configure PII detection optimally. Evaluate what cost would be of false positives and or false negatives in your results and use this information to make trade-offs between time/resources and accuracy. Regardless of how well you set up a system, you will get false positive results. This is often because certain entities of interest have a very common format without checksum or context to distinguish them.
Another problem is with names of people, since almost any set of letters is a potentially valid name somewhere. The ability to narrow down the region of interest and then set some rules on common and uncommon names will prevent significant time wasted in false positives. This comes at a cost, though, since one day that weird letter combination may well be a name and you will have missed it.
A particular type of bank code with a very simple format can potentially lead to a large volume of false positives. The entity includes a country code that can potentially be restricted to only probable bank locations, allowing a much tighter format check to be applied.
Problems that may arise with entity detection have to do with tables. These may have well-labeled column headers that match the names of defined entities, but the difficulty arises with the proximity of this landmark to the actual entry in the table. You can solve this with careful use of entity name detection, along with entity detection.
Finally, on occasion entities are split across columns or lines in a document. Where a forename/surname combination may provide a strong confidence factor, we now get a weaker match for the separate entities. Again, the clever use of landmarks and context with post filtering can help resolve these issues by resolving the parent entity.
I found it. Now what can I do with it?
You now need to decide what action if any you need to take on the file identified as having PII content at risk. The following common options are available, driven by corporate governance:
Delete the data. If there is no need to keep the file, remove it. Is it too old? Has the customer requested that his or her data be destroyed? The important thing here is to maintain an audit trail of both what you did and why you did it.
- Secure the data. If you need to keep the data, then secure it. This may be in the form of changing access controls or encrypting the data. A further option is to move it to a secure location such as a records management tool.
- Redact the data. You may need to keep some of the data but not the PII. Redaction can be used to create a clean copy of the original file that has no PII content left to read. The original file is then deleted or secured as described above.
Take action now
PII information is pervasive, it can be found everywhere, and if you don’t find it first, a hacker might. Take action now to reduce the risk to your organization with AI-powered unstructured data processing and analysis.
Keep learning
Understand the newest privacy laws with this Webcast: California’s own GDPR? It’s not alone.
Take a deep dive into the new privacy laws with TechBeacon's Guide to GDPR and CCPA.
- Get up to speed on cloud security and privacy and selecting the right encryption and key management with TechBeacon's Guide.