Micro Focus is now part of OpenText. Learn more >

You are here

You are here

Use data masking to ensure secure—and compliant—software delivery

public://pictures/manuelpais_2017_full_res-oreilly-292-medium-square.jpg
Manuel Pais IT Organizational Consultant, Independent
public://pictures/matthew-skelton-by-paula-brown-square-crop.jpg
Matthew Skelton Founder and Head of Consulting , Conflux
 

Let the software development organization without an issue in production that it can't reproduce in any other environment throw the first stone. 

With ever more complex and interconnected systems, and more digital channels and stacks than ever, controlling all the variables that can contribute to an issue—even a minor one—is simply impossible. Adding data storage and replication to the mix makes matters even worse.

One way to get the data aspect out of the system-complexity equation is to capture a dataset from your production environment and load it into your testing or development environment. People do this not only to reproduce production-only issues, but also to run performance tests, and sometimes even to run functional or exploratory tests. That's where data masking comes to the rescue.

How data masking protects your data

Data masking (or obfuscation) is the process of preventing unauthorized use of protected data by replacing, anonymizing, or removing sensitive pieces of information.

Data protection requirements come from multiple sources, including state laws, national and international regulations from industry agencies, and internal corporate policies (e.g., commercially sensitive data such as payment information), to name just a few. With these regulations, format and content differ according to region. For example, the Fair and Accurate Credit Transaction Act (FACTA) in the US differs from the comprehensive European Data Protection Directive 95/46/EC, and health insurance and healthcare providers in the US must comply with the Health Insurance Portability and Accountability Act (HIPAA).

These regulations also evolve over time, so regulated organizations must keep track of changes and ensure that they remain compliant. Automating the data-masking process speeds up not only the software delivery in these organizations but also the adoption of and compliance with new or extended regulations.

Data protection gone wrong

What happens when companies overlook data protection regulations and controls? In the best case, they can be fined and obliged to show proof of having changed their ways. But such penalties pale in comparison (and cost) with the reputational damage when data breaches are exposed to the public.

Unfortunately, many companies are still putting their reputation on the line, by either taking in security debt (e.g., repeatedly overlooking or postponing security concerns) or, worse, being unaware of their data protection obligations (and implications across the entire application delivery lifecycle and environments). In today’s world, even startups just beginning to gain traction are facing data breach attempts.

Recent high-visibility data breaches have raised public (and governmental) awareness of these issues. User authentication and financial information are often the focal points for security teams and officers in most organizations, but they are still confronted on a regular basis with successful attacks. A prime example is last September’s Equifax security breach, where at least 143 million Social Security and demographic records (including addresses and driver’s license numbers in some cases) were hacked, along with more than 200,000 credit card numbers.

But other kinds of personal information can be equally destructive, depending on the industry. The Ashley Madison data breach in 2015 not only caused the fall of the popular dating business (with more than 30 million users), but also damaged many personal lives and marriages, as leaked information on height, weight, and even erotic preferences were sufficient to identify members of the site.

Regulatory changes are a-coming

It’s no surprise then that governmental bodies are stepping up data protection regulations beyond financial and health data.

In particular, the European Union's new General Data Protection Regulation (GDPR), which goes into effect in May 2018, will require companies to adopt a comprehensive data governance approach, including data profiling, data quality, data lineage, data masking, test data management, data analytics, and data archival.

The data protection reach will extend to such things as genetic data, email, and IP addresses, and more. Also, users will have finer-grained rights on protected data kept by companies. Explicit consent will be required for an organization to use individual pieces of information such as email addresses and phone numbers, and their combined use will also require explicit consent. This alone will require a novel approach to the data access layer, possibly integrating with not-yet-available tools for dynamic data access control.

Another novelty relates to companies being obliged to report breaches to affected users within 72 hours of discovery—unless they can prove that the data was encrypted and/or masked in such a way that it is impossible for the attacker to reverse engineer it.

Integrating with development workflow

GDPR penalties can go as high as 4% of global revenues or 20 million euros (although actual amounts will vary according to the type of breach and follow the pattern set by initial court decisions). The bottom line: Your company could see its existence compromised if it mishandles customer data.

Given the size and dimension of the data breaches occurring today, it’s fair to say that nearly all organizations face a long road to compliance. You must step up security and data management skills, keep customer data out of nonproduction environments, and secure controls to production data. You must do so not just to stop security breaches right now, but also to build internal capabilities in this area and to expand as needed for GDPR compliance. Organizations able to demonstrate repeatable compliance procedures will benefit from a lighter regulatory hand if faced with data breaches/mishandling.

In terms of data management, integrating existing methods (such as data masking) and tools (such as SQL Data Generator) into the application delivery lifecycle in an automated fashion is crucial. Continuous delivery already helps improve traceability and auditability of changes to software. There is no reason why data management shouldn’t follow the same path.

To illustrate, consider the example of a publicly available data dump containing Stack Overflow posts. Once you duplicate the original database, you can alter the copy by replacing each field that needs to be masked by new values of the same type, generated by SQL Data Generator.

The Users table includes several user data fields:

SELECT AboutMe, Age, DisplayName, DownVotes, Location, Reputation, UpVotes, Views, WebsiteUrl
FROM StackOverflow.dbo.Users
WHERE Id=9

AboutMeAgeDisplayNameDownVotesLocationReputationUpVotesViews
Independent software engineer45Kevin Dente4Oakland, CA10272443603


By obfuscating the fields AboutMe, Age, DisplayName and Location (replacing them with new values with same data type), we get, for example:

SELECT AboutMe, Age, DisplayName, DownVotes, Location, Reputation, UpVotes, Views, WebsiteUrl
FROM [StackOverflow-Obfuscated].dbo.Users
WHERE Id=9

AboutMeAgeDisplayNameDownVotesLocationReputationUpVotesViews
dolor culpa sit tempor26Crystal Dixon4Des Moines, IA10272443603


Note how it's now impossible to identify the user, but you still keep all the data the algorithms use in StackOverflow's application logic (DownVotes, Reputation, UpVotes, Views).

You achieve anonymized data for functional testing (imagine that the algorithm was not expecting that someone could reach more than 999 UpVotes) and for performance and data testing (imagine a schema migration with data type changes, will all your production data transform successfully).

In the example above, you still want to have a set of functional tests in place for CRUD operations on the user data fields AboutMe, DisplayName, etc., but these are better suited for a curated test dataset where you aim to cover all possible classes of inputs, including multiple alphabets, special characters, and so on.

You can now safely use the resulting obfuscated table for testing or other purposes, in compliance with data protection regulations.

By automating and running this procedure on a regular basis, perhaps by using Powershell scripts to clone, obfuscate, and distribute the masked database to development and testing environments daily, you can integrate data masking in your development workflow and detect data-related issues in updates to your software. And with adequate logging in place, you'll benefit from traceability information that you might need in order to demonstrate compliance.

You can also lay the foundation for new data compliance procedures, for instance with regards to data usage consent. To ensure compliance, you can expect heavy investment in testing a myriad of combinations of data usage, especially when aggregated results need to be calculated for a set of users. Having a fresh version of the production database, with real consent configurations available in the continuous delivery pipeline (with the data itself obfuscated) will greatly facilitate the difficult task of managing fine-grained data usage consents.

Data masking: A structured approach to compliance

Compliance challenges such as the GDPR might seem daunting at first, but a structured approach and sound data management techniques can help greatly.

  • Start by researching which regulations apply to your business, based on location and industry—and look out for potential conflicts between them.
  • Then map the implications of those regulations in terms of checks and procedures you need to have in place before any changes hit production.

Adapt your software delivery process to automate those checks where possible. Finally, leverage the power of data masking to make testing with non-personally identifiable production data a permanent part of your deployment pipeline.

That's data masking in a nutshell. Post your questions and comments below, and I'll do my best to answer them.

Keep learning

Read more articles about: SecurityData Security