Data Masking is a well established approach to protecting sensitive data in a database yet allowing the data to still be usable. There are a number of reasons why organisations need to employ data masking, with two of the most common being:
- To obtain a recent copy of data from the production system for use in a test environment, thus enabling tests to be conducted on realistic data which better represents the business, but with the most sensitive parts redacted. Invariably, there is an increased level of reliability in the results, when tests are applied on real data, instead of using synthetically generated data.
- To provide an API for consumers to access live production data, but where the consumer’s role or security level means they are not permitted to see values of all the fields in each record. Instead the subset of protected fields are represented by obfuscated values that still carry meaning (e.g the fact that a field existed or that a date field’s value has an approximate rather than random date).
In this blog post I will show how MongoDB’s powerful aggregation pipelines can be used to efficiently mask data belonging to a MongoDB collection, with various examples of obfuscating/redacting fields. The power of MongoDB’s aggregation pipelines doesn’t necessarily come from any inherent ease of use, compared to say SQL. Indeed its learning curve is not insignificant (although it is far easier to learn and use than say building raw map-reduce code to run against a large data set in an Hadoop cluster, for example). The real power of aggregation pipelines comes from the fact that once a pipeline is defined, a single command can be issued to a MongoDB cluster to then be applied to massive data sets. The cluster might contain a database of billions or more of records, and the aggregation pipeline will be automatically optimised and executed, including transparently processing subsets of the data in parallel across each shard, to reduce the completion time.
To accompany this blog post, I have provided a GitHub project which shows the Mongo Shell commands used for defining the example aggregation pipeline and masking actions that I will outline here. Please keep referring to the project’s README as you read through this blog post, at:
The approach that I will describe provides examples of applying irreversible and non-deterministic data masking actions on fields. That is to say, as a result of applying the outlined data masking techniques, it would be extremely difficult, and in most cases impossible, for a user to reverse engineer or derive the original values of the masked fields.
[EDIT: 27-Feb-2021: For reversible data masking, see the part 2 blog post on this topic]
Sample Payments Data & Data Masking Examples
For these examples I will simulate a ‘payments system’ database containing a collection of ‘card payments’ records. Such a database might be used by a bank, a payment provider, a retailer or an eCommerce vendor, for example, to accumulate payments history. The example data structures here are intentionally over-simplified for clarity. In the screenshot below you can see an example of 2 sample payments records, in their original state, shown on the left hand side. On the right hand side of the screenshot you can see the result of applying the data masking transformation actions from the GitHub project, and which I will describe in more detail further below.
For the example project, 10 different fields have data masking actions applied. Below is a summary of each masking action, the field it is applied to and the type of change applied. You can view and cross reference these to the 10 corresponding aggregation pipeline operations provided in the companion GitHub project:
- Replace the card’s security code with a different random set of 3 digits (e.g. 133 → 472)
- For the card number, obfuscate the first 12 digits (e.g. 1234567890123456 → XXXXXXXXXXXX3456)
- For the card’s listed owner name, obfuscate the first part of the name, resulting in only the last name being retained (e.g. 'Mrs. Jane A. Doe' → 'Mx. Xxx Doe')
- For the payment transaction’s recorded date-time, obfuscate the value by adding or subtracting a random amount of time to the current value, by up to one hour maximum (e.g. 00:23:45 on 01-Sep-2019 → 23:59:33 on 31-Aug-2019)
- Replace the payment settlement's recorded date-time by taking an arbitrary fixed time (e.g. 29-Apr-2020) and adding a random amount of time to that, up to one year maximum (e.g. 07:48:55 on 15-Dec-2018 → 21:07:34 on 08-Jun-2020)
- Replace the payment card expiry date with the current date (e.g. 10-Feb-2021) + a random amount of days of up to one year maximum (e.g. 31-Mar-2020 → 31-Nov-2021)
- For the transaction’s payment amount, obfuscate the value by adding or subtracting a random percent to its current value, by up to 10% maximum (e.g. 49.99 → 45.51)
- Replace the payment’s ‘reported’ status boolean value with a new randomly generated true or false value where there is a 50:50 chance of the result being either value (e.g. false → true, false → false)
- Replace the payment transaction’s ID (which is composed of 16 hexadecimal digits), with a new 16 hexadecimal digit value based on an ‘MD5 hash’ of the original ID (note, do not regard this as 'cryptographically safe')
- For the extra customer info sub-document composed of 3 fields, only retain this sub-document and its fields if the value of its ‘category’ field is not equal to the text ‘SENSITIVE’ (i.e. redact out a sub-section of the document where the customer is marked as ‘sensitive’)
Exposing Masked Data To Consumers
Once this pipeline has been built, it can be used in one of four ways in MongoDB depending on your specific needs (again the companion GitHub project provides more specific details on how each of these four ways are achieved in MongoDB):
- DATA MASKED AGGREGATION ON DEMAND. Enable the aggregation pipeline to be executed on demand (by calling db.aggregate(pipeline) from a trusted mid-tier application, where the application would have rights to see the non-obfuscated data too. As a result the trusted mid-tier would need to be relied on to return just the result of the ‘data masking aggregation’ to consumers, and to never expose the underlying unmasked data, in the database, in any other way.
- DATA MASKED READ-ONLY VIEW. Create a view (e.g. payments_redacted_view) based on the aggregation pipeline and then, using MongoDB’s Role Based Access Control (RBAC) capabilities, only allow consumers to have access to the view, with no permissions to access the underlying source collection, thus ensuring consumers can only access the ‘data masked’ results. Consuming applications can even use a ‘find’ operation, with a ‘filter’ and ‘projection’, when querying the view, to reduce down further the fields and records they want to see.
- DATA MASKED COPY OF ORIGINAL DATA. Execute the aggregation pipeline using an additional $merge pipeline stage, to produce a brand new collection (e.g. payments_redacted), which contains only the data masked version of the original collection. The original collection can then either be locked down as not visible to the consumer, using RBAC, or can even just be deleted, if it is no longer required. The latter option is often applicable if you are producing a test environment with a masked version of recent production data, where the original raw data is no longer needed.
- DATA MASKED OVERWRITTEN ORIGINAL DATA. Execute the aggregation pipeline using an additional $merge pipeline stage, to overwrite records in the existing collection with modified versions of the corresponding records. The resulting updated collection will now only contain the masked version of the data. Again, this may be applicable in the case where you are producing a test environment with a masked version of recent production data, where the original collection is no longer needed.
In this post and the companion GitHub project, I’ve shown how common irreversible obfuscation patterns can be effectively defined and then applied to mask sensitive data in a MongoDB collection. The example shows only a couple of records being redacted. However, the real power comes from being able to apply the same aggregation pipeline, unchanged, to a collection containing a massive data set, often running across multiple shards, which will automatically be parallelised, to reduce turnaround times for provisioning masked data sets.
Song for today: The Black Crow by Songs: Ohia (Jason Molina)