Saturday, February 27, 2021

MongoDB Reversible Data Masking Example Pattern


In a previous blog post I explored how to apply one-way non-reversible data masking on a data-set in MongoDB. Here I will explore why, in some cases, there can be a need for reversible data masking, where, with appropriate privileges, each original record's data can be deduced from the masked version of the record. I will also explore how this can be implemented in MongoDB, using something I call the idempotent masked-id generator pattern.

To accompany this blog post, I have provided a GitHub project which shows the Mongo Shell commands used for implementing the reversible data masking pattern outlined here. Please keep referring to the project’s README as you read through this blog post, at:

Why Reversible Data Masks?

In some situations a department in an organisation, that masters and owns a set of data, may need to provide copies of the whole or subset of the data to a different department or even a different partner organisation. If the data-set contains sensitive data, like personally identifiable information (PII) for example, the 'data-owning' organisation will first need to perform data masking on the values of the sensitive fields of each record in the data-set. This redaction of fields will often be one-way (irreversible) preventing the other department or partner organisation from being able to reverse engineer the content of the masked data-set, to retrieve the original sensitive fields. 

Now consider the example where the main data-owning organisation is collecting results of 'tests'. The results could be related to medical tests where the data-owning organisation is a hospital for example. Or the results could be related to academic tests where the data-owning organisation is a school for example. Let's assume that the main organisation needs to provide data to a partner organisation for specialist analysis, to identify individuals with concerning test results patterns. However there needs to be assurance that each individual's sensitive details or real identity is not shared with the partner organisation.

How can the partner organisation report back to the main organisation flagging individuals for concern and recommended follow-up without actually having access to those real identities? 

One solution is for the redacted data-set that is provided to the partner organisation, to carry an obfuscated but reversible unique identity field as a substitute for the real identity, in addition to containing other irreversibly redacted sensitive fields. With this solution, it would not be possible for the partner organisation to reverse engineer the substituted unique identity, to a real social security number, national insurance number or national student identifier, for example. However, it would be possible for the main data-owning organisation to convert the substituted unique id back to the real identity, if subsequently required. 

The diagram below outlines the relationship between the data-owning organisation, the partner organisations and the data masked data-sets shared between them. 

A partner organisation can flag an individual of concern back to the main organisation, without ever being able to deduce the real life person who the substituted unique ID maps to.

How To Achieve Reversibility With MongoDB?

To enable a substituted unique ID to be correlated back to the original real ID of a person, the main data-owning organisation needs to be able to generate the substitute unique IDs in the first place, and maintain a list of mappings between the two, as shown in the diagram below.

The stored mappings list needs to be protected in such a way that only specific staff in the data-owning organisation, with specific approved roles, have access to it. This prevents the rest of the organisation and partner organisations from accessing the mappings to be able to reverse engineer masked identifies back to real identities.

Essentially, the overall process of masking and 'unmasking' data with MongoDB, as shown in the GitHub project accompanying this blog post, is composed of three different key aggregation pipelines:

  1. Generation of Unique ID Mappings. A pipeline for the data-owning organisation to generate the new unique anonymised substitute IDs for each person appearing in a test result, into a new mappings collection using the idempotent masked-id generator pattern
  2. Creation of the Reversible Masked Data-Set. A pipeline for the data-owning organisation to generate a masked version of the test results, where each person's id has been replaced with the substitute ID (an anonymous but reversible ID); additionally some other fields will be filtered out (e.g. national id, last name) or obfuscated with partly randomised values (e.g data of birth).
  3. Reverse Engineer Masked Data-Set Back To Real Identities. An optional pipeline, if./when required, for the data-owning organisation to be able to take the potentially modified partial masked data-set back from the partner organisation, and, using the mappings collection, reverse engineer the original identities and other sensitive fields. 

The screenshot below captures an example of the outcome of steps 1 and 2 of the process outlined above.

Here, each person's ID has been replaced with a reversible substitute unique ID. Additionally, the date of birth field ('dob') has been obfuscated (shown with the red underline) and some other sensitive fields have been filtered out.

I will now explore how each of the three outlined process steps is achieved in MongoDB, in the following three sub-sections.

1. Generation of Unique ID Mappings

As per the companion GitHub project, the list of original ID to substitute ID mappings is stored in a MongoDB collection with very strict RBAC controls applied. An example record in this collection might look like the one shown in the screenshot below.

Here the collection is called masked_id_mappings, where each record's field '_id' contains a newly generated substitute ID, based on a generated universally unique identifiers (UUIDs). The field 'original_id' contains the real identifier of the person or entity in the same format it was in, in the original data-set. For convenience, two date related attributes are included in each record. The 'date_generated' field is generally useful for tracking when the mapping was created (e.g. for reporting), and the 'date_expired' is associated with a time-to-live (TTL) index to enable the mapping to be automatically deleted by the database, after a period of time (3 years out, in this example).

The remaining field, 'data_purpose_id' is worthy of a little more detailed discussion. Let's say the same data-set needs to be provided to multiple 3rd parties, for different purposes. It makes sense to mask each copy of the data differently, with different unique IDs for the same original IDs. This can help prevent the risk of any potential future correlation of records between unrelated parties or consumers. Essentially when a mapping record is created, in addition to providing the original ID, a data purpose 'label' must be provided. A unique substitute ID is generated for a given source identity, per data use/purpose. For one specific data consumer purpose, the same substituted unique ID will be re-used for the same reoccurring original ID, However, a different substituted unique ID will be generated and used for an original ID when the purpose and consumer requesting a masked data-set is different.

To populate the masked_id_mappings collection, an aggregation pipeline (called 'maskedIdGeneratprPipeline') is run against each source collection (e.g. against both the 'persons' collection and the 'tests' collection). This aggregation pipeline implements he idempotent masked-id generator pattern. Essentially, this pattern involves taking each source collection, picking out the original unique id and then placing a record of this, with a newly generated UUID it is mapped to (plus the other metadata such as data_purpose_id, date_generated, date_expired), into the masked_id_mappings collection. The approach is idempotent in that the creation of each new mapping record is only fulfilled if a mapping doesn't already exist for the combination of the original unique id and data purpose. When further collections in the data-set are run through the same aggregation pipeline, if some records from these other collections have the same original id and data purpose as one of the records that already exists in the masked_id_mappings collection, a new record with a new UUID will not be inserted. This ensures that, per data purpose, the same original unique id is always mapped to the same substitute UUID, regardless of how often it appears in various collections. This idempotent masked-id generator process is illustrated in the diagram below. 

The same aggregation pipeline is run multiple times, once against each source collection which belongs to the source data-set. Even if the source data-set is ever added to in the future, the aggregation can be re-run against the same data-sets, over and over again, without any duplicates or other negative consequences, and with only the additions being acted upon. The pipeline is so generic that it can also be run against other previously unseen collections which have completely different shapes but where each contains an original unique ID in one of its field.

2. Creation of the Reversible Masked Data-Set

Once the mappings have been generated for a source data-set, it is time to actually generate a new masked set of records from the original data-set. Again, this is achieved by running an aggregation pipeline, once per different source collection in the source data-set. The diagram below illustrates how this process works. The aggregation pipeline takes the original ID fields from the source collection, then performs a lookup on the mappings collection (including the specific data purpose) to grab the previously generated substitute unique ID. The pipeline then replaces the original IDs with the substitute IDs, in the outputted data masked collection

The remaining part of the aggregation pipeline is less generic and must contain rules distinct to the source data-set it operations on. The latter part of the pipeline contains specific data masking actions to apply to specific sensitive fields in the specific data-set.

The generated masked data-set collections can then be exported ready to be shipped to the consuming business unit or 3rd party organisation, who can then import the masked data-set into their own local database infrastructure, ready to perform their own analysis on.

3. Reverse Engineer Masked Data-Set Back To Real Identities

In the example 'test results' scenario, the partner organisation may need to subsequently report back to the main organisation flagging individuals for concern and recommended follow-up. They can achieve this by providing the substituted identities back to the owning organisation, with the additional information outlining why the specific individuals have been flagged. The GitHub project accompanying this blog post shows an example of performing this reversal, where some of the 'tests' collection records in the masked data-set have been marked with the following new attribute by the 3rd party organisation:


The GitHub project then shows how the masked and now flagged data-set, if passed back to the original data-owning organisation, can then have a 'reverse' aggregation pipeline executed on it by the original organisation. This 'reverse' aggregation pipeline looks up the mappings collection again, but this time to retrieve the original ID using the substitute unique IDs provided in the input (plus the data purpose). This results in a reverse engineered view of the data with real identities flagged, thus enabling the original data-owning organisation to schedule follow-ups with the identified people.


In this blog post I have explored why it is sometimes necessary to apply data masking to a data-set, for the masked data-set to be subsequently distributed to another business unit or organisation, but where the original identities can be safely reversed engineered. This is achieved with the appropriate strong data access controls and privileges in place, for access by specific users in the original data-owning organisation only, if the need arises. As a result, no sensitive data is ever exposed to the lesser trusted parties. 

Song for today: Lullaby by Low

No comments: