Saturday, May 15, 2021

New MongoDB Aggregations book is out

My book, Practical MongoDB Aggregations, was published this week.

The book is available electronically for free for anyone to use at: https://www.practical-mongodb-aggregations.com


This book is intended for developers, architects, data analysts, data engineers, and data scientists. It aims to improve your productivity and effectiveness when building aggregation pipelines and help you understand how to optimise their pipelines.

The book is split into two key parts:

  1. A set of tips and principles to help you get the most out of aggregations.
  2. A bunch of example aggregation pipelines for solving common data manipulation challenges, which you can easily copy and try for yourself.

I hope readers get some good value from it!


Song for today: Earthmover by Have a Nice Life


Saturday, February 27, 2021

MongoDB Reversible Data Masking Example Pattern

Introduction

In a previous blog post I explored how to apply one-way non-reversible data masking on a data-set in MongoDB. Here I will explore why, in some cases, there can be a need for reversible data masking, where, with appropriate privileges, each original record's data can be deduced from the masked version of the record. I will also explore how this can be implemented in MongoDB, using something I call the idempotent masked-id generator pattern.

To accompany this blog post, I have provided a GitHub project which shows the Mongo Shell commands used for implementing the reversible data masking pattern outlined here. Please keep referring to the project’s README as you read through this blog post, at:


Why Reversible Data Masks?

In some situations a department in an organisation, that masters and owns a set of data, may need to provide copies of the whole or subset of the data to a different department or even a different partner organisation. If the data-set contains sensitive data, like personally identifiable information (PII) for example, the 'data-owning' organisation will first need to perform data masking on the values of the sensitive fields of each record in the data-set. This redaction of fields will often be one-way (irreversible) preventing the other department or partner organisation from being able to reverse engineer the content of the masked data-set, to retrieve the original sensitive fields. 

Now consider the example where the main data-owning organisation is collecting results of 'tests'. The results could be related to medical tests where the data-owning organisation is a hospital for example. Or the results could be related to academic tests where the data-owning organisation is a school for example. Let's assume that the main organisation needs to provide data to a partner organisation for specialist analysis, to identify individuals with concerning test results patterns. However there needs to be assurance that each individual's sensitive details or real identity is not shared with the partner organisation.

How can the partner organisation report back to the main organisation flagging individuals for concern and recommended follow-up without actually having access to those real identities? 

One solution is for the redacted data-set that is provided to the partner organisation, to carry an obfuscated but reversible unique identity field as a substitute for the real identity, in addition to containing other irreversibly redacted sensitive fields. With this solution, it would not be possible for the partner organisation to reverse engineer the substituted unique identity, to a real social security number, national insurance number or national student identifier, for example. However, it would be possible for the main data-owning organisation to convert the substituted unique id back to the real identity, if subsequently required. 

The diagram below outlines the relationship between the data-owning organisation, the partner organisations and the data masked data-sets shared between them. 

A partner organisation can flag an individual of concern back to the main organisation, without ever being able to deduce the real life person who the substituted unique ID maps to.


How To Achieve Reversibility With MongoDB?

To enable a substituted unique ID to be correlated back to the original real ID of a person, the main data-owning organisation needs to be able to generate the substitute unique IDs in the first place, and maintain a list of mappings between the two, as shown in the diagram below.

The stored mappings list needs to be protected in such a way that only specific staff in the data-owning organisation, with specific approved roles, have access to it. This prevents the rest of the organisation and partner organisations from accessing the mappings to be able to reverse engineer masked identifies back to real identities.

Essentially, the overall process of masking and 'unmasking' data with MongoDB, as shown in the GitHub project accompanying this blog post, is composed of three different key aggregation pipelines:

  1. Generation of Unique ID Mappings. A pipeline for the data-owning organisation to generate the new unique anonymised substitute IDs for each person appearing in a test result, into a new mappings collection using the idempotent masked-id generator pattern
  2. Creation of the Reversible Masked Data-Set. A pipeline for the data-owning organisation to generate a masked version of the test results, where each person's id has been replaced with the substitute ID (an anonymous but reversible ID); additionally some other fields will be filtered out (e.g. national id, last name) or obfuscated with partly randomised values (e.g data of birth).
  3. Reverse Engineer Masked Data-Set Back To Real Identities. An optional pipeline, if./when required, for the data-owning organisation to be able to take the potentially modified partial masked data-set back from the partner organisation, and, using the mappings collection, reverse engineer the original identities and other sensitive fields. 

The screenshot below captures an example of the outcome of steps 1 and 2 of the process outlined above.


Here, each person's ID has been replaced with a reversible substitute unique ID. Additionally, the date of birth field ('dob') has been obfuscated (shown with the red underline) and some other sensitive fields have been filtered out.

I will now explore how each of the three outlined process steps is achieved in MongoDB, in the following three sub-sections.


1. Generation of Unique ID Mappings

As per the companion GitHub project, the list of original ID to substitute ID mappings is stored in a MongoDB collection with very strict RBAC controls applied. An example record in this collection might look like the one shown in the screenshot below.


Here the collection is called masked_id_mappings, where each record's field '_id' contains a newly generated substitute ID, based on a generated universally unique identifiers (UUIDs). The field 'original_id' contains the real identifier of the person or entity in the same format it was in, in the original data-set. For convenience, two date related attributes are included in each record. The 'date_generated' field is generally useful for tracking when the mapping was created (e.g. for reporting), and the 'date_expired' is associated with a time-to-live (TTL) index to enable the mapping to be automatically deleted by the database, after a period of time (3 years out, in this example).

The remaining field, 'data_purpose_id' is worthy of a little more detailed discussion. Let's say the same data-set needs to be provided to multiple 3rd parties, for different purposes. It makes sense to mask each copy of the data differently, with different unique IDs for the same original IDs. This can help prevent the risk of any potential future correlation of records between unrelated parties or consumers. Essentially when a mapping record is created, in addition to providing the original ID, a data purpose 'label' must be provided. A unique substitute ID is generated for a given source identity, per data use/purpose. For one specific data consumer purpose, the same substituted unique ID will be re-used for the same reoccurring original ID, However, a different substituted unique ID will be generated and used for an original ID when the purpose and consumer requesting a masked data-set is different.

To populate the masked_id_mappings collection, an aggregation pipeline (called 'maskedIdGeneratprPipeline') is run against each source collection (e.g. against both the 'persons' collection and the 'tests' collection). This aggregation pipeline implements he idempotent masked-id generator pattern. Essentially, this pattern involves taking each source collection, picking out the original unique id and then placing a record of this, with a newly generated UUID it is mapped to (plus the other metadata such as data_purpose_id, date_generated, date_expired), into the masked_id_mappings collection. The approach is idempotent in that the creation of each new mapping record is only fulfilled if a mapping doesn't already exist for the combination of the original unique id and data purpose. When further collections in the data-set are run through the same aggregation pipeline, if some records from these other collections have the same original id and data purpose as one of the records that already exists in the masked_id_mappings collection, a new record with a new UUID will not be inserted. This ensures that, per data purpose, the same original unique id is always mapped to the same substitute UUID, regardless of how often it appears in various collections. This idempotent masked-id generator process is illustrated in the diagram below. 



The same aggregation pipeline is run multiple times, once against each source collection which belongs to the source data-set. Even if the source data-set is ever added to in the future, the aggregation can be re-run against the same data-sets, over and over again, without any duplicates or other negative consequences, and with only the additions being acted upon. The pipeline is so generic that it can also be run against other previously unseen collections which have completely different shapes but where each contains an original unique ID in one of its field.


2. Creation of the Reversible Masked Data-Set

Once the mappings have been generated for a source data-set, it is time to actually generate a new masked set of records from the original data-set. Again, this is achieved by running an aggregation pipeline, once per different source collection in the source data-set. The diagram below illustrates how this process works. The aggregation pipeline takes the original ID fields from the source collection, then performs a lookup on the mappings collection (including the specific data purpose) to grab the previously generated substitute unique ID. The pipeline then replaces the original IDs with the substitute IDs, in the outputted data masked collection


The remaining part of the aggregation pipeline is less generic and must contain rules distinct to the source data-set it operations on. The latter part of the pipeline contains specific data masking actions to apply to specific sensitive fields in the specific data-set.

The generated masked data-set collections can then be exported ready to be shipped to the consuming business unit or 3rd party organisation, who can then import the masked data-set into their own local database infrastructure, ready to perform their own analysis on.


3. Reverse Engineer Masked Data-Set Back To Real Identities

In the example 'test results' scenario, the partner organisation may need to subsequently report back to the main organisation flagging individuals for concern and recommended follow-up. They can achieve this by providing the substituted identities back to the owning organisation, with the additional information outlining why the specific individuals have been flagged. The GitHub project accompanying this blog post shows an example of performing this reversal, where some of the 'tests' collection records in the masked data-set have been marked with the following new attribute by the 3rd party organisation:

'flag' : 'INTERVENTION-REQUIRED'

The GitHub project then shows how the masked and now flagged data-set, if passed back to the original data-owning organisation, can then have a 'reverse' aggregation pipeline executed on it by the original organisation. This 'reverse' aggregation pipeline looks up the mappings collection again, but this time to retrieve the original ID using the substitute unique IDs provided in the input (plus the data purpose). This results in a reverse engineered view of the data with real identities flagged, thus enabling the original data-owning organisation to schedule follow-ups with the identified people.


Summary

In this blog post I have explored why it is sometimes necessary to apply data masking to a data-set, for the masked data-set to be subsequently distributed to another business unit or organisation, but where the original identities can be safely reversed engineered. This is achieved with the appropriate strong data access controls and privileges in place, for access by specific users in the original data-owning organisation only, if the need arises. As a result, no sensitive data is ever exposed to the lesser trusted parties. 


Song for today: Lullaby by Low

Wednesday, February 10, 2021

MongoDB Data Masking Examples

Introduction

Data Masking is a well established approach to protecting sensitive data in a database yet allowing the data to still be usable. There are a number of reasons why organisations need to employ data masking, with two of the most common being:

  1. To obtain a recent copy of data from the production system for use in a test environment, thus enabling tests to be conducted on realistic data which better represents the business, but with the most sensitive parts redacted. Invariably, there is an increased level of reliability in the results, when tests are applied on real data, instead of using synthetically generated data.
  2. To provide an API for consumers to access live production data, but where the consumer’s role or security level means they are not permitted to see values of all the fields in each record. Instead the subset of protected fields are represented by obfuscated values that still carry meaning (e.g the fact that a field existed or that a date field’s value has an approximate rather than random date).

In this blog post I will show how MongoDB’s powerful aggregation pipelines can be used to efficiently mask data belonging to a MongoDB collection, with various examples of obfuscating/redacting fields. The power of MongoDB’s aggregation pipelines doesn’t necessarily come from any inherent ease of use, compared to say SQL. Indeed its learning curve is not insignificant (although it is far easier to learn and use than say building raw map-reduce code to run against a large data set in an Hadoop cluster, for example). The real power of aggregation pipelines comes from the fact that once a pipeline is defined, a single command can be issued to a MongoDB cluster to then be applied to massive data sets. The cluster might contain a database of billions or more of records, and the aggregation pipeline will be automatically optimised and executed, including transparently processing subsets of the data in parallel across each shard, to reduce the completion time.

To accompany this blog post, I have provided a GitHub project which shows the Mongo Shell commands used for defining the example aggregation pipeline and masking actions that I will outline here. Please keep referring to the project’s README as you read through this blog post, at:

The approach that I will describe provides examples of applying irreversible and non-deterministic data masking actions on fields. That is to say, as a result of applying the outlined data masking techniques, it would be extremely difficult, and in most cases impossible, for a user to reverse engineer or derive the original values of the masked fields.

[EDIT: 27-Feb-2021: For reversible data masking, see the part 2 blog post on this topic]


Sample Payments Data & Data Masking Examples

For these examples I will simulate a ‘payments system’ database containing a collection of ‘card payments’ records. Such a database might be used by a bank, a payment provider, a retailer or an eCommerce vendor, for example, to accumulate payments history. The example data structures here are intentionally over-simplified for clarity. In the screenshot below you can see an example of 2 sample payments records, in their original state, shown on the left hand side. On the right hand side of the screenshot you can see the result of applying the data masking transformation actions from the GitHub project, and which I will describe in more detail further below.


For the example project, 10 different fields have data masking actions applied. Below is a summary of each masking action, the field it is applied to and the type of change applied. You can view and cross reference these to the 10 corresponding aggregation pipeline operations provided in the companion GitHub project:

  1. Replace the card’s security code with a different random set of 3 digits  (e.g. 133 → 472)
  2. For the card number, obfuscate the first 12 digits  (e.g. 1234567890123456 → XXXXXXXXXXXX3456)
  3. For the card’s listed owner name, obfuscate the first part of the name, resulting in only the last name being retained  (e.g. 'Mrs. Jane A. Doe' → 'Mx. Xxx Doe')
  4. For the payment transaction’s recorded date-time, obfuscate the value by adding or subtracting a random amount of time to the current value, by up to one hour maximum  (e.g. 00:23:45  on 01-Sep-2019 → 23:59:33 on 31-Aug-2019)
  5. Replace the payment settlement's recorded date-time by taking an arbitrary fixed time (e.g. 29-Apr-2020) and adding a random amount of time to that, up to one year maximum  (e.g. 07:48:55 on 15-Dec-2018 → 21:07:34 on 08-Jun-2020)
  6. Replace the payment card expiry date with the current date (e.g. 10-Feb-2021) + a random amount of days of up to one year maximum  (e.g. 31-Mar-2020 → 31-Nov-2021)
  7. For the transaction’s payment amount, obfuscate the value by adding or subtracting a random percent to its current value, by up to 10% maximum  (e.g. 49.99 → 45.51)
  8. Replace the payment’s ‘reported’ status boolean value with a new randomly generated true or false value where there is a 50:50 chance of the result being either value  (e.g. false → true, false → false)
  9. Replace the payment transaction’s ID (which is composed of 16 hexadecimal digits), with a new 16 hexadecimal digit value based on an ‘MD5 hash’ of the original ID  (note, do not regard this as 'cryptographically safe')
  10. For the extra customer info sub-document composed of 3 fields, only retain this sub-document and its fields if the value of its ‘category’ field is not equal to the text ‘SENSITIVE’  (i.e. redact out a sub-section of the document where the customer is marked as ‘sensitive’)

Exposing Masked Data To Consumers

Once this pipeline has been built, it can be used in one of four ways in MongoDB depending on your specific needs (again the companion GitHub project provides more specific details on how each of these four ways are achieved in MongoDB):

  1. DATA MASKED AGGREGATION ON DEMAND. Enable the aggregation pipeline to be executed on demand (by calling db.aggregate(pipeline) from a trusted mid-tier application, where the application would have rights to see the non-obfuscated data too. As a result the trusted mid-tier would need to be relied on to return just the result of the ‘data masking aggregation’ to consumers, and to never expose the underlying unmasked data, in the database, in any other way.
  2. DATA MASKED READ-ONLY VIEW. Create a view (e.g. payments_redacted_view) based on the aggregation pipeline and then, using MongoDB’s Role Based Access Control (RBAC) capabilities, only allow consumers to have access to the view, with no permissions to access the underlying source collection, thus ensuring consumers can only access the ‘data masked’ results. Consuming applications can even use a ‘find’ operation, with a ‘filter’ and ‘projection’, when querying the view, to reduce down further the fields and records they want to see.
  3. DATA MASKED COPY OF ORIGINAL DATA. Execute the aggregation pipeline using an additional $merge pipeline stage, to produce a brand new collection (e.g. payments_redacted), which contains only the data masked version of the original collection. The original collection can then either be locked down as not visible to the consumer, using RBAC, or can even just be deleted, if it is no longer required. The latter option is often applicable if you are producing a test environment with a masked version of recent production data, where the original raw data is no longer needed.
  4. DATA MASKED OVERWRITTEN ORIGINAL DATA. Execute the aggregation pipeline using an additional $merge pipeline stage, to overwrite records in the existing collection with modified versions of the corresponding records. The resulting updated collection will now only contain the masked version of the data. Again, this may be applicable in the case where you are producing a test environment with a masked version of recent production data, where the original collection is no longer needed.

Summary

In this post and the companion GitHub project, I’ve shown how common irreversible obfuscation patterns can be effectively defined and then applied to mask sensitive data in a MongoDB collection. The example shows only a couple of records being redacted. However, the real power comes from being able to apply the same aggregation pipeline, unchanged, to a collection containing a massive data set, often running across multiple shards, which will automatically be parallelised, to reduce turnaround times for provisioning masked data sets.


Song for today: The Black Crow by Songs: Ohia (Jason Molina)