Saturday, May 15, 2021

New MongoDB Aggregations book is out

My book, Practical MongoDB Aggregations, was published this week.

The book is available electronically for free for anyone to use at:

This book is intended for developers, architects, data analysts, data engineers, and data scientists. It aims to improve your productivity and effectiveness when building aggregation pipelines and help you understand how to optimise their pipelines.

The book is split into two key parts:

  1. A set of tips and principles to help you get the most out of aggregations.
  2. A bunch of example aggregation pipelines for solving common data manipulation challenges, which you can easily copy and try for yourself.

I hope readers get some good value from it!

Song for today: Earthmover by Have a Nice Life

Saturday, February 27, 2021

MongoDB Reversible Data Masking Example Pattern


In a previous blog post I explored how to apply one-way non-reversible data masking on a data-set in MongoDB. Here I will explore why, in some cases, there can be a need for reversible data masking, where, with appropriate privileges, each original record's data can be deduced from the masked version of the record. I will also explore how this can be implemented in MongoDB, using something I call the idempotent masked-id generator pattern.

To accompany this blog post, I have provided a GitHub project which shows the Mongo Shell commands used for implementing the reversible data masking pattern outlined here. Please keep referring to the project’s README as you read through this blog post, at:

Why Reversible Data Masks?

In some situations a department in an organisation, that masters and owns a set of data, may need to provide copies of the whole or subset of the data to a different department or even a different partner organisation. If the data-set contains sensitive data, like personally identifiable information (PII) for example, the 'data-owning' organisation will first need to perform data masking on the values of the sensitive fields of each record in the data-set. This redaction of fields will often be one-way (irreversible) preventing the other department or partner organisation from being able to reverse engineer the content of the masked data-set, to retrieve the original sensitive fields. 

Now consider the example where the main data-owning organisation is collecting results of 'tests'. The results could be related to medical tests where the data-owning organisation is a hospital for example. Or the results could be related to academic tests where the data-owning organisation is a school for example. Let's assume that the main organisation needs to provide data to a partner organisation for specialist analysis, to identify individuals with concerning test results patterns. However there needs to be assurance that each individual's sensitive details or real identity is not shared with the partner organisation.

How can the partner organisation report back to the main organisation flagging individuals for concern and recommended follow-up without actually having access to those real identities? 

One solution is for the redacted data-set that is provided to the partner organisation, to carry an obfuscated but reversible unique identity field as a substitute for the real identity, in addition to containing other irreversibly redacted sensitive fields. With this solution, it would not be possible for the partner organisation to reverse engineer the substituted unique identity, to a real social security number, national insurance number or national student identifier, for example. However, it would be possible for the main data-owning organisation to convert the substituted unique id back to the real identity, if subsequently required. 

The diagram below outlines the relationship between the data-owning organisation, the partner organisations and the data masked data-sets shared between them. 

A partner organisation can flag an individual of concern back to the main organisation, without ever being able to deduce the real life person who the substituted unique ID maps to.

How To Achieve Reversibility With MongoDB?

To enable a substituted unique ID to be correlated back to the original real ID of a person, the main data-owning organisation needs to be able to generate the substitute unique IDs in the first place, and maintain a list of mappings between the two, as shown in the diagram below.

The stored mappings list needs to be protected in such a way that only specific staff in the data-owning organisation, with specific approved roles, have access to it. This prevents the rest of the organisation and partner organisations from accessing the mappings to be able to reverse engineer masked identifies back to real identities.

Essentially, the overall process of masking and 'unmasking' data with MongoDB, as shown in the GitHub project accompanying this blog post, is composed of three different key aggregation pipelines:

  1. Generation of Unique ID Mappings. A pipeline for the data-owning organisation to generate the new unique anonymised substitute IDs for each person appearing in a test result, into a new mappings collection using the idempotent masked-id generator pattern
  2. Creation of the Reversible Masked Data-Set. A pipeline for the data-owning organisation to generate a masked version of the test results, where each person's id has been replaced with the substitute ID (an anonymous but reversible ID); additionally some other fields will be filtered out (e.g. national id, last name) or obfuscated with partly randomised values (e.g data of birth).
  3. Reverse Engineer Masked Data-Set Back To Real Identities. An optional pipeline, if./when required, for the data-owning organisation to be able to take the potentially modified partial masked data-set back from the partner organisation, and, using the mappings collection, reverse engineer the original identities and other sensitive fields. 

The screenshot below captures an example of the outcome of steps 1 and 2 of the process outlined above.

Here, each person's ID has been replaced with a reversible substitute unique ID. Additionally, the date of birth field ('dob') has been obfuscated (shown with the red underline) and some other sensitive fields have been filtered out.

I will now explore how each of the three outlined process steps is achieved in MongoDB, in the following three sub-sections.

1. Generation of Unique ID Mappings

As per the companion GitHub project, the list of original ID to substitute ID mappings is stored in a MongoDB collection with very strict RBAC controls applied. An example record in this collection might look like the one shown in the screenshot below.

Here the collection is called masked_id_mappings, where each record's field '_id' contains a newly generated substitute ID, based on a generated universally unique identifiers (UUIDs). The field 'original_id' contains the real identifier of the person or entity in the same format it was in, in the original data-set. For convenience, two date related attributes are included in each record. The 'date_generated' field is generally useful for tracking when the mapping was created (e.g. for reporting), and the 'date_expired' is associated with a time-to-live (TTL) index to enable the mapping to be automatically deleted by the database, after a period of time (3 years out, in this example).

The remaining field, 'data_purpose_id' is worthy of a little more detailed discussion. Let's say the same data-set needs to be provided to multiple 3rd parties, for different purposes. It makes sense to mask each copy of the data differently, with different unique IDs for the same original IDs. This can help prevent the risk of any potential future correlation of records between unrelated parties or consumers. Essentially when a mapping record is created, in addition to providing the original ID, a data purpose 'label' must be provided. A unique substitute ID is generated for a given source identity, per data use/purpose. For one specific data consumer purpose, the same substituted unique ID will be re-used for the same reoccurring original ID, However, a different substituted unique ID will be generated and used for an original ID when the purpose and consumer requesting a masked data-set is different.

To populate the masked_id_mappings collection, an aggregation pipeline (called 'maskedIdGeneratprPipeline') is run against each source collection (e.g. against both the 'persons' collection and the 'tests' collection). This aggregation pipeline implements he idempotent masked-id generator pattern. Essentially, this pattern involves taking each source collection, picking out the original unique id and then placing a record of this, with a newly generated UUID it is mapped to (plus the other metadata such as data_purpose_id, date_generated, date_expired), into the masked_id_mappings collection. The approach is idempotent in that the creation of each new mapping record is only fulfilled if a mapping doesn't already exist for the combination of the original unique id and data purpose. When further collections in the data-set are run through the same aggregation pipeline, if some records from these other collections have the same original id and data purpose as one of the records that already exists in the masked_id_mappings collection, a new record with a new UUID will not be inserted. This ensures that, per data purpose, the same original unique id is always mapped to the same substitute UUID, regardless of how often it appears in various collections. This idempotent masked-id generator process is illustrated in the diagram below. 

The same aggregation pipeline is run multiple times, once against each source collection which belongs to the source data-set. Even if the source data-set is ever added to in the future, the aggregation can be re-run against the same data-sets, over and over again, without any duplicates or other negative consequences, and with only the additions being acted upon. The pipeline is so generic that it can also be run against other previously unseen collections which have completely different shapes but where each contains an original unique ID in one of its field.

2. Creation of the Reversible Masked Data-Set

Once the mappings have been generated for a source data-set, it is time to actually generate a new masked set of records from the original data-set. Again, this is achieved by running an aggregation pipeline, once per different source collection in the source data-set. The diagram below illustrates how this process works. The aggregation pipeline takes the original ID fields from the source collection, then performs a lookup on the mappings collection (including the specific data purpose) to grab the previously generated substitute unique ID. The pipeline then replaces the original IDs with the substitute IDs, in the outputted data masked collection

The remaining part of the aggregation pipeline is less generic and must contain rules distinct to the source data-set it operations on. The latter part of the pipeline contains specific data masking actions to apply to specific sensitive fields in the specific data-set.

The generated masked data-set collections can then be exported ready to be shipped to the consuming business unit or 3rd party organisation, who can then import the masked data-set into their own local database infrastructure, ready to perform their own analysis on.

3. Reverse Engineer Masked Data-Set Back To Real Identities

In the example 'test results' scenario, the partner organisation may need to subsequently report back to the main organisation flagging individuals for concern and recommended follow-up. They can achieve this by providing the substituted identities back to the owning organisation, with the additional information outlining why the specific individuals have been flagged. The GitHub project accompanying this blog post shows an example of performing this reversal, where some of the 'tests' collection records in the masked data-set have been marked with the following new attribute by the 3rd party organisation:


The GitHub project then shows how the masked and now flagged data-set, if passed back to the original data-owning organisation, can then have a 'reverse' aggregation pipeline executed on it by the original organisation. This 'reverse' aggregation pipeline looks up the mappings collection again, but this time to retrieve the original ID using the substitute unique IDs provided in the input (plus the data purpose). This results in a reverse engineered view of the data with real identities flagged, thus enabling the original data-owning organisation to schedule follow-ups with the identified people.


In this blog post I have explored why it is sometimes necessary to apply data masking to a data-set, for the masked data-set to be subsequently distributed to another business unit or organisation, but where the original identities can be safely reversed engineered. This is achieved with the appropriate strong data access controls and privileges in place, for access by specific users in the original data-owning organisation only, if the need arises. As a result, no sensitive data is ever exposed to the lesser trusted parties. 

Song for today: Lullaby by Low

Wednesday, February 10, 2021

MongoDB Data Masking Examples


Data Masking is a well established approach to protecting sensitive data in a database yet allowing the data to still be usable. There are a number of reasons why organisations need to employ data masking, with two of the most common being:

  1. To obtain a recent copy of data from the production system for use in a test environment, thus enabling tests to be conducted on realistic data which better represents the business, but with the most sensitive parts redacted. Invariably, there is an increased level of reliability in the results, when tests are applied on real data, instead of using synthetically generated data.
  2. To provide an API for consumers to access live production data, but where the consumer’s role or security level means they are not permitted to see values of all the fields in each record. Instead the subset of protected fields are represented by obfuscated values that still carry meaning (e.g the fact that a field existed or that a date field’s value has an approximate rather than random date).

In this blog post I will show how MongoDB’s powerful aggregation pipelines can be used to efficiently mask data belonging to a MongoDB collection, with various examples of obfuscating/redacting fields. The power of MongoDB’s aggregation pipelines doesn’t necessarily come from any inherent ease of use, compared to say SQL. Indeed its learning curve is not insignificant (although it is far easier to learn and use than say building raw map-reduce code to run against a large data set in an Hadoop cluster, for example). The real power of aggregation pipelines comes from the fact that once a pipeline is defined, a single command can be issued to a MongoDB cluster to then be applied to massive data sets. The cluster might contain a database of billions or more of records, and the aggregation pipeline will be automatically optimised and executed, including transparently processing subsets of the data in parallel across each shard, to reduce the completion time.

To accompany this blog post, I have provided a GitHub project which shows the Mongo Shell commands used for defining the example aggregation pipeline and masking actions that I will outline here. Please keep referring to the project’s README as you read through this blog post, at:

The approach that I will describe provides examples of applying irreversible and non-deterministic data masking actions on fields. That is to say, as a result of applying the outlined data masking techniques, it would be extremely difficult, and in most cases impossible, for a user to reverse engineer or derive the original values of the masked fields.

[EDIT: 27-Feb-2021: For reversible data masking, see the part 2 blog post on this topic]

Sample Payments Data & Data Masking Examples

For these examples I will simulate a ‘payments system’ database containing a collection of ‘card payments’ records. Such a database might be used by a bank, a payment provider, a retailer or an eCommerce vendor, for example, to accumulate payments history. The example data structures here are intentionally over-simplified for clarity. In the screenshot below you can see an example of 2 sample payments records, in their original state, shown on the left hand side. On the right hand side of the screenshot you can see the result of applying the data masking transformation actions from the GitHub project, and which I will describe in more detail further below.

For the example project, 10 different fields have data masking actions applied. Below is a summary of each masking action, the field it is applied to and the type of change applied. You can view and cross reference these to the 10 corresponding aggregation pipeline operations provided in the companion GitHub project:

  1. Replace the card’s security code with a different random set of 3 digits  (e.g. 133 → 472)
  2. For the card number, obfuscate the first 12 digits  (e.g. 1234567890123456 → XXXXXXXXXXXX3456)
  3. For the card’s listed owner name, obfuscate the first part of the name, resulting in only the last name being retained  (e.g. 'Mrs. Jane A. Doe' → 'Mx. Xxx Doe')
  4. For the payment transaction’s recorded date-time, obfuscate the value by adding or subtracting a random amount of time to the current value, by up to one hour maximum  (e.g. 00:23:45  on 01-Sep-2019 → 23:59:33 on 31-Aug-2019)
  5. Replace the payment settlement's recorded date-time by taking an arbitrary fixed time (e.g. 29-Apr-2020) and adding a random amount of time to that, up to one year maximum  (e.g. 07:48:55 on 15-Dec-2018 → 21:07:34 on 08-Jun-2020)
  6. Replace the payment card expiry date with the current date (e.g. 10-Feb-2021) + a random amount of days of up to one year maximum  (e.g. 31-Mar-2020 → 31-Nov-2021)
  7. For the transaction’s payment amount, obfuscate the value by adding or subtracting a random percent to its current value, by up to 10% maximum  (e.g. 49.99 → 45.51)
  8. Replace the payment’s ‘reported’ status boolean value with a new randomly generated true or false value where there is a 50:50 chance of the result being either value  (e.g. false → true, false → false)
  9. Replace the payment transaction’s ID (which is composed of 16 hexadecimal digits), with a new 16 hexadecimal digit value based on an ‘MD5 hash’ of the original ID  (note, do not regard this as 'cryptographically safe')
  10. For the extra customer info sub-document composed of 3 fields, only retain this sub-document and its fields if the value of its ‘category’ field is not equal to the text ‘SENSITIVE’  (i.e. redact out a sub-section of the document where the customer is marked as ‘sensitive’)

Exposing Masked Data To Consumers

Once this pipeline has been built, it can be used in one of four ways in MongoDB depending on your specific needs (again the companion GitHub project provides more specific details on how each of these four ways are achieved in MongoDB):

  1. DATA MASKED AGGREGATION ON DEMAND. Enable the aggregation pipeline to be executed on demand (by calling db.aggregate(pipeline) from a trusted mid-tier application, where the application would have rights to see the non-obfuscated data too. As a result the trusted mid-tier would need to be relied on to return just the result of the ‘data masking aggregation’ to consumers, and to never expose the underlying unmasked data, in the database, in any other way.
  2. DATA MASKED READ-ONLY VIEW. Create a view (e.g. payments_redacted_view) based on the aggregation pipeline and then, using MongoDB’s Role Based Access Control (RBAC) capabilities, only allow consumers to have access to the view, with no permissions to access the underlying source collection, thus ensuring consumers can only access the ‘data masked’ results. Consuming applications can even use a ‘find’ operation, with a ‘filter’ and ‘projection’, when querying the view, to reduce down further the fields and records they want to see.
  3. DATA MASKED COPY OF ORIGINAL DATA. Execute the aggregation pipeline using an additional $merge pipeline stage, to produce a brand new collection (e.g. payments_redacted), which contains only the data masked version of the original collection. The original collection can then either be locked down as not visible to the consumer, using RBAC, or can even just be deleted, if it is no longer required. The latter option is often applicable if you are producing a test environment with a masked version of recent production data, where the original raw data is no longer needed.
  4. DATA MASKED OVERWRITTEN ORIGINAL DATA. Execute the aggregation pipeline using an additional $merge pipeline stage, to overwrite records in the existing collection with modified versions of the corresponding records. The resulting updated collection will now only contain the masked version of the data. Again, this may be applicable in the case where you are producing a test environment with a masked version of recent production data, where the original collection is no longer needed.


In this post and the companion GitHub project, I’ve shown how common irreversible obfuscation patterns can be effectively defined and then applied to mask sensitive data in a MongoDB collection. The example shows only a couple of records being redacted. However, the real power comes from being able to apply the same aggregation pipeline, unchanged, to a collection containing a massive data set, often running across multiple shards, which will automatically be parallelised, to reduce turnaround times for provisioning masked data sets.

Song for today: The Black Crow by Songs: Ohia (Jason Molina)

Tuesday, November 24, 2020

Is Querying A MongoDB View Optimised?

Views in MongoDB appear to database users like read-only collections, ready to be queried in the same way normal collections are. A View is defined by an Aggregation pipeline and when a query is issued on a View, using find(), there is the potential for the execution of the View to be optimised by MongoDB in the same way as MongoDB would optimise any aggregation pipeline that is executed.

In reality, most applications will not issue a find() without specifying a query filter as an argument. This begs the question: When issuing a find() with a query filter against a View (backed by an aggregation pipeline), how is the combination optimised, and can indexes be leveraged effectively?

In the rest of this post, I will explore this further and answer this question.

Source Collection Data

The data I am using for the investigation is a music based data-set sourced from the Discogs website, imported from Discog's XML data dump, using an XML MongoDB import utility.

The resulting releases collection, representing the albums and singles released by all artists, has over 1.5 million documents in it. I've defined various obvious indexes for the collection in anticipation of wanting to run finds and aggregations efficiently against it. Below is a screenshot showing some of the data in this collection, illustrating each document's typical shape...

As you can see, the releases collection contains fields for the artist, the title of the release, the year of the release and the music genres & styles associated with the release.

Let's now look at using two different Views, with different degrees of complexity, against this same collection, to see if and how these Views are optimised at runtime, when a find() is issued...

Using A View Which Filters Out Some Records & Fields

So let's create a View which only shows music released since the start of the year 2000, concatenates the array of one or more styles into a new 'style' string field and then excludes the 'styles' and '_id' fields from the result.

 var pipeline = [

    {$match: {'year': {'$gte': 2000}}},

    {$set: {'style': {

        $reduce: {

            input: '$styles',

            initialValue: '',

            in: {$concat: ['$$value', '$$this', '. ']}



    {$unset: ['styles']},

    {$unset: ['_id']},


 db.createView('millennium_releases_view', 'releases', pipeline);

Below is an example of the shape of result documents, when the View is queried for a specific artist:

If I ask MongoDB to provide the explain plan for an 'empty' query on the View, using the following command...


...the resulting explain plan shows the database runs the following steps in the order shown:

  1. MATCH using INDEX SCAN hitting an index for the Year field
  2. SET new Style string field to concatenate values from existing Styles array field
  3. UNSET Styles array field
  4. UNSET _id field

It's good to see here that the 'year greater than or equal' clause in the aggregation pipeline defined for the View is being run as the first step and is targeting an index to avoid a 'full table scan'. However, what happens when I include a query filter when issuing a find() against the View, to only show releases for a specific artist?

 db.millennium_releases_view.find({'artist': 'Fugazi'}).explain();

This time the resulting explain plan shows the following steps executed:

  1. MATCH using INDEX SCAN hitting a compound index composed of both the Artist & Year fields
  2. SET new Style string field to concatenate values from existing Styles array field
  3. UNSET Styles array field
  4. UNSET _id field

This is great news, because when I am specifying a query filter for the find() on this View, the optimiser is converting the regular find() filter syntax into an aggregation match expression and pushing it to the existing $match stage at the start of the pipeline. As a result, the optimum compound index of (artist, year) is being used, to entirely satisfy the find's 'artist=Fugazi' expression combined with the View's 'year>=2000' expression,.

Does this mean a find() with a query filter will always be pushed to the top of the View's aggregation pipeline, at runtime?

Well actually, no. Let's see why, in this second example...

Using A View Which Rolls Up Some Data

This time let's create a View which groups releases (albums & singles) for each artist by the style associated with the release. For example, if an artist has five albums categorised with the style 'Stoner Rock' and 7 albums categorised by 'Post Rock', the resulting View will contain 2 documents for the artist, one for each of the two styles. This is the command for creating this View:

 var pipeline = [

    {$unwind: {path: '$styles'}}, 

    {$group: {

        _id: {artist: '$artist', style: '$styles'}, 

        titles: {'$push': '$title'},


    {$set: {'artist': '$_id.artist'}},

    {$set: {'style': '$'}}, 

    {$unset: ['_id']},


 db.createView('styles_view', 'releases', pipeline);

Below is an example of the shape of result documents from querying this new second View, for a specific artist:

If I ask MongoDB to provide the explain plan for an 'empty' query on this View, using the following command...


    ...the resulting explain plan shows the database runs the following steps in the order shown:

    1. COLLECTION_SCAN with PROJECTION of Artist, Styles & Title fields only
    2. UNWIND of Styles array field producing a record for each array element
    3. GROUP on Artist + Style fields, adding each associated release title to a new Titles array field
    4. SET Artist string field to the first of element of the group's id
    5. SET Style string field to the second of element of the group's id
    6. UNSET _id field which was created by the group stage

    As expected here, because the aggregation pipeline defined for the View does not contain a $match, the first step will result in a 'full table scan', where all the documents in the collection are inspected, and then the required fields only, are projected out.

    What happens this time when I include a query filter for the find() run against the View, to only show results for a specific artist, using the following command to explain?

     db.styles_view.find({'artist': 'Fugazi'}).explain();

      This time, the resulting explain plan shows the following ordered steps executed:

      1. COLLECTION_SCAN with PROJECTION of ArtistStyles & Title fields only
      2. UNWIND of Styles array field producing a record for each array element
      3. GROUP on Artist + Style fields, adding each associated release title to a new Titles array field
      4. SET Artist string field to the first of element of the group's id
      5. MATCH on Artist filed (no index used)
      6. SET Style string field to the second of element of the group's id
      7. UNSET _id field which was created by the group stage

      Here the new $match generated by MongoDB to capture the find() expression run against the View, is included in the executed aggregation pipeline, but the $match cannot be pushed all the way up to the first step of the pipeline. This is to be expected...

      Essentially what happens when a find() with filter is run on a View is as follows. The filter expression is initially placed in a new $match stage appended to the end of the aggregation pipeline. Then the normal aggregation pipeline runtime optimiser kicks in and attempts to move the newly added $match step as near to the top of the pipeline as possible. However, the $group stage (and related $set on artist, in this case), acts as a barrier. The $group operator stage completely changes the shape of documents and effectively drops any existing fields that preceded it. The optimiser has no way of knowing that a filter on an artist field, being applied to the outcome of a View, is definitively referring to a field called artist that existed in the original source collection used by the View. Instead, for all it knows, the expression on artist could be referring to some other intermediate aggregation pipeline field of similar name. In the example above, even if we don't use $set in the View's pipeline to set a new field called artist, the new $match expression is still blocked by $group and so is only executed straight after $group (a scenario which I also tested).

      So even though I only want to see the results for one artist which relates to only a few 10s of documents in the database, the find() which applies a filter on the View, will result in the total data set of 1.5 million documents being 'full table scanned', adding considerable latency to the response.

      If I wasn't querying the View and instead running my own hand-crafted aggregation pipeline directly against the source collection, to achieve the same functional outcome, my pipeline could be composed of the following stages where I explicitly include the match on artist as the first stage:

       var pipeline = [

          {$match: {'artist': 'Fugazi'}},

          {$unwind: {path: '$styles'}}, 

          {$group: {

              _id: {artist: '$artist', style: '$styles'}, 

              titles: {'$push': '$title'},


          {$set: {'artist': '$_id.artist'}},

          {$set: {'style': '$'}}, 

          {$unset: ['_id']},



      Then when I ask for the explain plan...


        ...I see that the following steps are executed:

        1. MATCH using INDEX_SCAN on the Artist field with PROJECTION of ArtistStyles & Title fields only
        2. UNWIND of Styles array field producing a record for each array element
        3. GROUP on Artist + Style fields, adding each release title to a new Titles array field
        4. SET Artist string field from the first of element in the group id
        5. SET Style string field from the second of element in the group id
        6. UNSET _id field which was created by the group stage

        This time an index will be leveraged so that only the few 10s of records, corresponding to the desired artist, are retrieved, ready for unwinding and grouping. The aggregation does not attempt to grab 1.5 million records. This is only possible because, as the developer of the aggregation pipeline logic, I have extra knowledge which the MongoDB runtime does not have. Specifically, I know that the $match on the artist field should actually be applied to the field named artist in the View's source collection and not to the result of the $group stage.

        Wrapping Up

        What these findings show for Views is that at runtime, when MongoDB receives a find() containing query filter expressions, these expressions are dynamically appended to the end of the View's aggregation pipeline, before the resulting composite pipeline is executed. Then, as is the case if you are just issuing a regular aggregation against a normal collection, MongoDB's aggregation pipeline runtime optimiser attempts to re-order the pipeline on the fly, without changing its functional behaviour, to be more efficient. These runtime optimisations include attempting to push any $match stages as near to the start of the pipeline as possible, to help promote maximum use of indexes when executed. However, stages like the $group stage, which completely transform the shape of documents, mean that the optimiser cannot move a $match ahead of such stages, without risking changing the functional behaviour and ultimately the resulting output. 

        In practice, where Views are used to filter a subset of records and/or a subset of fields. the system should be able to fully optimise the find() run against the collection, pushing query filter expressions to the first step of the executed aggregation pipeline, to best leverage indexes. Only in places where there is a loss of fidelity (e.g when using a $group stage), will it be the case that the find() query filter cannot be placed earlier in the pipeline being executed against the View. 

        Song for today: Runaway Return by Fugazi

        Sunday, October 4, 2020

        Rust & MongoDB - Perfect Bedfellows

        I've been learning Rust over the last month or so and I'm really enjoying it. It's a really elegant and flexible programming language despite being the most strongly typed and compile-time strict programming language I've ever used (bearing in mind I used to be a professional C & C++ developer way back in the day). 

        I'd recently read the really good and commonly referenced blog post Creating a REST API in Rust with warp, which shows how to create a simple example Groceries stock management REST API service, and which uses an in-memory HashMap as its backing store. As part of my learning I thought I'd have a go at porting this to use MongoDB as its data store instead, using the fairly new MongoDB Rust Driver.

        It turns out that this was really easy to do, also due to how well engineered the new MongoDB Rust Driver turned out to be, with its rich yet easy to use API. 

        You can see my resulting MongoDB version of this sample Groceries application, in the Github project rust-groceries-mongo-api I created. Check out that project link to view the source code showing how MongoDB was integrated with for the Groceries REST API and how to test the application using a REST client.

        What was even more surprising was how easy it was to integrate MongoDB's flexible data model with a programming language as strict as Rust, and I encountered no friction between the two at all. In fact, this was even easier to achieve by leveraging the option of using the driver team's additional contribution of BSON translation to the open source Rust Serde framework, which makes it easy to serialize/deserialize Rust data structures to/from other formats (e.g. JSON, Avro and now BSON).

        I plan to blog again in the future, in more detail, about how to combine Rust's strict typing and MongoDB's flexible schema, especially when the data model and consuming microservices inevitably change over time. [UPDATE 09-Dec-2020: I have now blogged on this at MongoDB DevHub, see: The Six Principles for Building Robust Yet Flexible Shared Data Applications]

        Song for today: Dissolution by Cloud Nothings

        Sunday, May 3, 2020

        Converting Gnarly Date Strings to Proper Date Types Using a MongoDB Aggregation Pipeline

        I recently received some example bank payments data in a CSV file which had been exported from a relational database with that database's default export settings. After using mongoimport to import this data 'as-is', into a MongoDB database, I noticed that there was a particularly gnarly date string field in each record. For example:
        • 23-NOV-20
        Why do I say gnarly? Well if you lived through Y2K you should be horrified by the 'year' field shown above. How would you know from the data, without any context, what century this applies to? Is it 1920? Is it 2020? Is it 2120? There's no way of knowing from just the exported data alone. Also, there is no indication of which time zone this applies to. Is it British Summer Time? Is it Eastern Daylight Time? Who knows? Also the month element appears to be an abbreviation of a month expressed in a specific spoken language. Which spoken language?

        I needed to get this into a proper Date type in MongoDB so I could then easily index it, perform date range queries natively, perform sort by date natively, etc.. My usual tool of choice for this is MongoDB's Aggregation pipeline to generate a new collection from the existing collection with the 'date' string fields converted to proper date type fields. To perform the string to date conversion, the usual operator of choice to use is $dateFromString (introduced in MongoDB 3.6). 

        However, $dateFromString [rightly] expects an input string which isn't missing crucial date related text, indicating things like the century or timezone. Also, the $dateFromString operator contains no format specifiers to indicate that the text 'NOV' maps to the 11th month of a year in a specific spoken language.

        Therefore, armed with the extra context of knowing this exported data refers to dates in the 21st century (the '2000s') with a UTC 'time zone' and in the English language (only inferred by asking the owner of the data), I had to perform some additional string manipulation in the aggregation pipeline before using $dateFromString to generate a true and accurate date type. The rest of this blog post shows how I achieved this for date strings like '23-NOV-20'.

        Converting Incomplete Date Strings to Date Types Example

        In the Mongo Shell targeting a running MongoDB test database, run the following code to insert 12 sample 'payment' records, with example 'bad date string' fields for testing each month of a sample year.

        use test;
          {'account_id': '010101', 'pymntdate': '01-JAN-20', 'amount': 1.01},
          {'account_id': '020202', 'pymntdate': '02-FEB-20', 'amount': 2.02},
          {'account_id': '030303', 'pymntdate': '03-MAR-20', 'amount': 3.03},
          {'account_id': '040404', 'pymntdate': '04-APR-20', 'amount': 4.04},
          {'account_id': '050505', 'pymntdate': '05-MAY-20', 'amount': 5.05},
          {'account_id': '060606', 'pymntdate': '06-JUN-20', 'amount': 6.06},
          {'account_id': '070707', 'pymntdate': '07-JUL-20', 'amount': 7.07},
          {'account_id': '080808', 'pymntdate': '08-AUG-20', 'amount': 8.08},
          {'account_id': '090909', 'pymntdate': '09-SEP-20', 'amount': 9.09},
          {'account_id': '101010', 'pymntdate': '10-OCT-20', 'amount': 10.10},
          {'account_id': '111111', 'pymntdate': '11-NOV-20', 'amount': 11.11},
          {'account_id': '121212', 'pymntdate': '12-DEC-20', 'amount': 12.12}

        Then execute the following Aggregation pipeline to copy the contents of the 'rawpayments' collection, populated above, into a new collection named 'payments', but with the 'pymntdate' field values converted from string types to date types.

          {$set: {
            pymntdate: {
              $dateFromString: {format: '%d-%m-%Y %H.%M.%S.%L', dateString:
                {$concat: [
                  {$substrCP: ['$pymntdate', 0, 3]},  // USE FIRST 3 CHARS IN DATE STRING
                  {$switch: {branches: [  // REPLACE MONTH 3 CHARS IN DATE STRING WITH 2 DIGIT MONTH
                    {case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'JAN']}, then: '01'},
                    {case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'FEB']}, then: '02'},
                    {case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'MAR']}, then: '03'},
                    {case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'APR']}, then: '04'},
                    {case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'MAY']}, then: '05'},
                    {case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'JUN']}, then: '06'},
                    {case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'JUL']}, then: '07'},
                    {case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'AUG']}, then: '08'},
                    {case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'SEP']}, then: '09'},
                    {case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'OCT']}, then: '10'},
                    {case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'NOV']}, then: '11'},
                    {case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'DEC']}, then: '12'},
                   ], default: 'ERROR'}},
                  '-20',  // ADD HYPEHN + HARDCODED CENTURY 2 DIGITS
          {$out: 'payments'}

        In this pipeline, the string '23-NOV-20' will be converted to 'ISODate("2020-11-23T22:57:36.827Z")' by concatenating the following four elements of text together before passing it to the $dateFromString operator to convert to a date:
        1. '23-' (from the input string)
        2. '11' (replacing 'NOV')
        3. '-20' (hard-coded hyphen + century)
        4. '20' (the rest of input string apart from last 6 nanosecond digits)
        Note: A $set stage is used in this pipeline, which is a type of stage first introduced in MongoDB 4.2. $set is an alias for $addFields, so if using an earlier version of MongoDB, replace $set with $addFields in the pipeline.

        To see what the converted records look like, containing new date types, query the new collection:

        db.payments.find({}, {_id:0});

        Which will show the following results:

        { "account_id" : "010101", "pymntdate" : ISODate("2020-01-01T01:01:01.123Z"), "amount" : 1.01 }
        { "account_id" : "020202", "pymntdate" : ISODate("2020-02-02T02:02:02.456Z"), "amount" : 2.02 }
        { "account_id" : "030303", "pymntdate" : ISODate("2020-03-03T03:03:03.789Z"), "amount" : 3.03 }
        { "account_id" : "040404", "pymntdate" : ISODate("2020-04-04T04:04:04.012Z"), "amount" : 4.04 }
        { "account_id" : "050505", "pymntdate" : ISODate("2020-05-05T05:05:05.345Z"), "amount" : 5.05 }
        { "account_id" : "060606", "pymntdate" : ISODate("2020-06-06T06:06:06.678Z"), "amount" : 6.06 }
        { "account_id" : "070707", "pymntdate" : ISODate("2020-07-07T07:07:07.901Z"), "amount" : 7.07 }
        { "account_id" : "080808", "pymntdate" : ISODate("2020-08-08T08:08:08.234Z"), "amount" : 8.08 }
        { "account_id" : "090909", "pymntdate" : ISODate("2020-09-09T09:09:09.567Z"), "amount" : 9.09 }
        { "account_id" : "101010", "pymntdate" : ISODate("2020-10-10T10:10:10.890Z"), "amount" : 10.1 }
        { "account_id" : "111111", "pymntdate" : ISODate("2020-11-11T11:11:11.111Z"), "amount" : 11.11 }
        { "account_id" : "121212", "pymntdate" : ISODate("2020-12-12T12:12:12.999Z"), "amount" : 12.12 }

        Song for today: For Everything by The Murder Capital

        Sunday, December 29, 2019

        Running MongoDB on ChromeOS (via Crostini)

        In my previous post I explored Linux application support in ChromeOS and Chromebooks (a.k.a. Crostini). Of course I was bound to try running MongoDB in this environment, which I found to work really well (for development purposes). Here's my notes on running a MongoDB database and tools on a Chromebook with Linux (beta) enabled:
        • In ChromeOS, launch the Terminal app (which opens a Shell inside the 'Penguin' Linux container inside the 'Termina' Linux VM)
        • Run the following commands which are documented in the MongoDB Manual page on installing MongoDB Enterprise on Debian (following the manual's tab instructions titled “Debian 9 "Stretch”):
        wget -qO - | sudo apt-key add -
        echo "deb stretch/mongodb-enterprise/4.2 main" | sudo tee /etc/apt/sources.list.d/mongodb-enterprise.list
        sudo apt-get update
        sudo apt-get install -y mongodb-enterprise
        • Start a MongoDB database instance running:
        mkdir ~/data
        mongod --dbpath ~/data
        • Launch a second Terminal window and then run the Mongo Shell against this database and perform a quick database insert and query test:

        • Install Python 3 and the PIP Python package manager (using Anaconda) and then install the MongoDB Python driver (PyMongo):
        bash Anaconda3-*
        source ~/.bashrc
        python --version
        pip --version
        pip install --user pymongo
        • Test PyMongo by running a small ‘payments data generator’ Python script pulled down from a GitHub repository (this should insert records into the MongoDB local database’s “fs.payments” collection; after letting it run for a minute, continuously inserting new records, press Ctrl-C to stop it):
        git clone
        cd PaymentsWriteReadConcerns/
        ./ -p 1
        • Download MongoDB Compass (use the Ubuntu 64-bit 14.04+ version), install and run it against the 'localhost' MongoDB database and inspect the contents of the “fs.payments” collection:
        sudo apt install ./mongodb-compass_*_amd64.deb

        Song for today: Sun. Tears. Red by Jambinai

        My Notes on Linux Application Support in ChromeOS (a.k.a. Crostini)

        These are my own rough notes from spending a few days studying Chrome OS and its Linux app support on a HP Chromebook 14* I got for free (retails for about £150) when I recently purchased a Google Pixel 4 Android mobile phone. I thought I’d share the notes in case they are of use to others. I’m sure there needs to be some corrections, so feedback is welcome.
         * released: 2019, model: db0003na, codename: careena, board: grunt

        Some references to other articles that I used to bootstrap my knowledge:

        Below are some screenshots showing the ChromeOS Settings section where “Linux (beta)” (a.k.a. Crostini) can be enabled and the Linux apps that are then installed by default when (essentially just the GNOME Help application and the Terminal application, from which many other Linux apps can subsequently be installed):

        Here is a diagram I put together to attempt to capture the architecture of Crostini in ChromeOS as I understand it (the rest of this document digs into the details behind some of these layers):

        ChromeOS & Crostini

        • Under the covers, ChromeOS is based on Gentoo and the Portage package manager
        • crosh (ChromeOS Developer Shell) is the pluggable command line shell/terminal for ChromeOS (in the Chrome browser, enter Ctrl-Alt-T to launch crosh inside a browser tab)
        • Crostini is the term for Linux application support in ChromeOS which manages the specific Linux VM and then the specific Linux container inside it, managing the lifecycle of when to launch them, mounting the filesystem to show the container’s files in the ChromeOS Files app, etc.. Crostini provides easy to use Linux application support integrated directly into the running ChromeOS desktop, rather than, for example, needing to dual boot or having to run a separate Linux VM and needing to explicitly switch, via the desktop, between ChromeOS and the Linux VM.
        • ChromeOS also has a Developer mode (verification is disabled when the OS boots) which is a special mode built into all Chromebooks to allow users and developers to access the code behind the Chrome Operating System and load their own builds of ChromeOS. This mode also allows users to install and run another Linux system like Ubuntu instead of ChromeOS (i.e. dual boot), but still have ChromeOS available to boot into too
        • As an alternative to Crostini, in addition to the dual-boot option, developer mode can also be used for Crouton which is a set of scripts that bundle up a chroot generator/environment to run both ChromeOS and Ubuntu at the same time. Here a Linux OS runs alongside ChromeOS, so users can switch between the ChromeOS desktop and Linux desktops via a keyboard shortcut. This gives users the ability to take advantage of both environments without needing to reboot. Unlike with virtualisation, a second OS is not being booted and instead the guest OS is running using the Chromium OS system. As a result any performance penalty is reduced because everything is run natively, and RAM is not being wasted to boot two OSes at the same time. Note, Crostini is different than this Crouton capability, as it enables the Linux shell and apps to be brought into the platform in verified (non-developer) mode with seamless user interface desktop integration and multi-layered security, in a supported way.
        • To use Crostini, from the ChromeOS Settings select ‘Linux (Beta)’ and choose to enable it, which, behind the scenes, will download and configure a specific Linux VM containing a specific Linux Container (see the next sections for more details) and it adds a launcher group to the ChromeOS desktop called ‘Linux Apps’. This launcher group includes a launcher to run a Linux shell/terminal application, called Terminal, which is displayed in the ChromeOS desktop but is connected directly inside the container

        Crostini Linux VM Layer

        • crosvm (ChromeOS Virtual Machine Monitor) is a custom virtual machine manager written in Rust that runs guest VMs via Linux's KVM hypervisor virtualisation layer and manages the low-level virtual I/O device communication (Amazon’s Firecracker is a fork of crosvm)
        • A specific VM is used to run a container rather than ChromeOS running a container directly, for security reasons because containers do not provide sufficient security isolation on their own. With the two layers, an adversary has to exploit crosvm via its limited interactions with the guest, in addition to the container, and the VM itself is heavily sandboxed.
        • The VM (and its container) are tied to a ChromeOS login session and as soon as a user logs out, all programs are shut down/killed by design (all user data lives in the user’s encrypted home to ensure nothing is leaked when a user log out). The VM, container and their data are persisted across user sessions and are kept in the same per-user encrypted storage as the rest of the browser's data.
        • KVM generally (rather than Crostini specifically) can execute multiple virtual machines running unmodified Linux or Windows images. Each virtual machine has private virtualised hardware: a network card, disk, graphics adapter, etc. The kernel component of KVM is included in mainline Linux codebase and the userspace component of KVM is included in mainline QEMU codebase
        • Termina is the VM launched by crosvm and is based on a ChromeOS (CrOS) image with a stripped-down ChromeOS Linux kernel and userland tools. The main goal is to just boot up Termina as quickly as possible, as a secure sandbox, and start running containers.
        • Currently, other custom VMs (other Linux variants, Windows, etc) cannot be run and only instances of the Termina VM image can be booted, although multiple VM instances can be run simultaneously based on the Termina image
        • vmc is the crosh command line utility to manually manage custom VM instances via Concierge (the ChromeOS daemon that manages VM/container life cycles)
        • To view the registered VM(s) from crosh (Ctrl-Alt-T), which may or may not be running, run:
        vmc list
        • To launch the Termina VM as a VM instance called ‘termina’ and open a shell directly in the VM, run:
        vmc start termina
        • With the above command, the default container in the VM will not be started automatically. However, instead, if from the ChromeOS desktop, a Linux Shell (Terminal) or other Linux App is launched (or the ‘Linux files’ app, Files , is launched) the Termina VM is automatically launched and the default container it owns is also automatically started
        • If the Termina VM is already running, to connect to it via a shell, run:
        vsh termina
        • If the ‘vmc start’ command is run with a different VM name, a new VM of that name will be created, launched and its shell entered from the existing terminal command line. This will use the same Termina image, and when running, ‘vmc list’ with list both VMs (the new instance doesn’t have any containers defined in it by default, ready to run, unlike the main Termina VM)
        • To stop the main Termina VM, run:
        vmc stop termina

        Crostini Container Layer

        • The Termina VM only supports running containers using the “Linux Containers” (LXC) technology at the moment and doesn’t support Docker or other container technologies
        • The default container instance launched via Termina is called Penguin and is based on Debian 9 with some custom packages
        • Containers are run inside a VM rather than programs running directly in the VM to help keep VM startup times low, to help improve security sandboxing by providing a stateless immutable VM image and to allow the container, its applications and their dependencies to be maintained independently from the VM, which otherwise may have contradicting dependecy requirements
        • LXC, generally, works in the vanilla Linux kernel requiring no additional patches to be applied to the kernel source and uses various kernel features to contain processes including kernel namespaces (ipc, uts, mount, pid, network and user), Apparmor and SELinux profiles, Seccomp policies, chroots (using pivot_root), CGroups (control groups). LXCFS provides the userspace (FUSE) filesystem providing overlay files for cpuinfo, meminfo, stat and uptime plus a cgroupfs compatible tree allowing unprivileged writes.
        • LXD is a higher level container framework, which Crostini uses and LXD uses its own specific image formats and also provides the ability to manage containers remotely. Although LXD uses LXC under the covers, it is based on more than just LXC. The Termina VM is configured to run the LXD daemon. Confusingly, the command line tool for controlling LXD is called ‘lxc’ (the ‘LXD Client). If users are using LXD commands to manage containers, they should avoid using any commands that start with ‘lxc-’ as these are lower level LXC commands. Users should avoid mixing and matching the use of both sets of commands in the same system. Crostini uses LXD to launch the Penguin container and LXD is configured to only allow unprivileged containers to be run, for added security. Therefore with Crostini, users should not use the lower level ‘lxc-’ commands because these can’t manage the LXD derived containers that Crostini uses. By default, LXD comes with 3 remote repositories providing images: 1) ubuntu: (for stable Ubuntu images), 2) ubuntu-daily: (for daily Ubuntu images), and 3) images: (for other distros)
        • In the Termina VM, the full LXC/LXD capabilities are provided, and remote images for many types of distros can be used to spawn multiple containers, in addition to the main Penguin container (these are not tested or certified though so may or may not work correctly)
        • Sommelier (a Wayland proxy compositor provides seamless X forwarding integration for content, input events, clipboard data, etc... between Linux apps and the ChromeOS desktop) and Garcon (a daemon for passing requests between the container and ChromeOS) binaries are bind-mounted into the main Penguin container. The Penguin container’s systemd is automatically configured to start these daemons. The libraries for these daemons are already present in the Penguin container LXD image used for Penguin (‘google:debian/stretch’). Other LXD containers launched in the VM don't seem to be enabled for their X based GUI apps to be displayed in the ChromeOS desktop, even if they use the special ‘google:debian/stretch’ LXD container image as it seems Crostini won’t attempt to integrate with this at runtime. Note: Some online articles imply it may be possible to get X-forwarding working from multiple containers.
        • In the Penguin container (which users can access directly, via the Terminal app launcher in the ChomeOS desktop), users can query the IP address of the container which is accessible from ChromeOS and can then run crosh (Ctrl-Alt-T) in ChromeOS and ping the IP address of the container directly. Users can also SSH from the ChromeOS desktop to the Penguin container using Google’s official SSH client that can be installed in Chrome via Chrome Web Store
        • If other containers are launched and then Google’s official SSH client is installed in ChromeOS (install ‘Secure Shell Extension’ via the Chrome Web Store), users can then define SFTP mount-points to other non-Penguin containers and the files in these containers will automatically appear in the Files app too 
        • From the Termina VM, users can use the standard LXD lxc command line tool to list containers and then to see if the Penguin container is running, by running:
        lxc list
        lxc info penguin | grep "Status: "

        • To check the logs for the Penguin container, run:
        lxc info --show-log penguin
        • To open a command line shell as root in the running container (note, the Terminal app has a different identity for connecting to the Penguin container, which is a non-root user), run
        lxc exec penguin -- /bin/bash
        • Within the Penguin container you can run GUI apps which automatically display in the main ChromeOS user interface. For example to install the GEdit text editor Linux application run the following (which also adds a launcher for GEdit in the ChromeOS desktop ‘Linux Apps’ launcher group):
        sudo apt install gedit

        • It is even possible to install and run a new Google Chrome browser installation from the Linux container, by running the following (which also adds a launcher for this Linux version of Chrome in the ChromeOS desktop ‘Linux Apps’ launcher group):
        sudo apt install ./google-chrome-stable_current_amd64.deb

        • From crosh (Ctrl-Alt-T), it is also possible to start the main container in the main VM (if not already started) and then connect a shell directly to the main container in the main VM, by running
        vmc container termina penguin
        vsh termina penguin

        Playing with Custom Containers

        • First of all launch crosh (Ctrl-Alt-T), and connect a shell to the Termina VM:
        vsh termina
        • Import Google’s own image repository into LXD to include the special Debian image used by Penguin:
        lxc remote list
        lxc remote add google --protocol=simplestreams
        lxc remote list
        lxc image list google:
        lxc image info google:debian/stretch

        • Launch and test a container using Google’s special Debian 9 image:
        lxc launch google:debian/stretch mycrosdebiancontainer
        lxc list
        lxc exec mycrosdebiancontainer -- /bin/bash
        cat /etc/*elease*
        apt update && apt upgrade -y

        • Launch and test a container using a standard Ubuntu 18.04 image:
        lxc launch ubuntu:18.04 myubuntucontainer
        lxc list
        lxc exec myubuntucontainer -- /bin/bash
        cat /etc/*elease*
        apt update && apt upgrade -y

        • Launch and test a container using a standard Centos 7 image:
        lxc launch images:centos/7 mycentoscontainer
        lxc list
        lxc exec mycentoscontainer -- /bin/bash
        cat /etc/*elease*
        yum -y update

        • If the Chromebook is rebooted and the Termina VM restarted, these 3 containers still exist as they are persisted, but they will be in a stopped state. When the containers are then manually restarted they will still have the same settings, files and modifications that were made before they were stopped. To start a stopped container run (example shown for one of the containers):
        lxc start myubuntucontainer
        • None of the containers launched above seem to enable GUI apps (e.g. GEdit) to be forwarded automatically to the ChromeOS desktop. Even though the ‘google:debian/stretch’ based container has the relevant X forwarding libraries bundled, it doesn't seem to be automatically integrated with at runtime by the Crostini framework to enable X forwarding
        • Another way to launch a new container is to use one of the following commands, although, again, neither seem to automatically configure X-forwarding, even though they use the ‘google:debian/stretch’ image. It seems that only the Penguin container specifically is beiung managed by Crostini and has X forwarding configured (the first command below should be launched from ChromeOS crosh, the second command which is deprecated performs the same action but should be run from inside the Termina VM:
        vmc container termina mycontainer --container_name=mycontainer --user=jdoe --shell

        • Note, this may throw a timeout error similar to below, but the containers do seem to be created ok:
        Error: routine at frontends/ `container_create(vm_name,user_id_hash,container_name,image_server,image_alias)` failed: timeout while waiting for signal

        Song for today: The Desert Song, No.2 - live by Sophia

        Thursday, December 19, 2019

        Some Tips for Diagnosing Client Connection Issues for MongoDB Atlas


           [UPDATE 07-Sep-2020: I've now written an executable binary tool you can run which performs the equivalent of the checks in this blog post to diagnose connectivity issues to Atas or any other type of MongoDB deployment, downloadable from here]

        By default, for recent MongoDB drivers and client tools, MongoDB Atlas advertises the exposed URL for a deployed database cluster using a service name which maps to a set of DNS SRV records to provide an initial connection seed list. This results in a much more 'human digestible' URL, but more importantly, increases deployment flexibility and the ability for underlying database server hosts to migrate over time, without needing to subsequently reconfigure clients.

        For example, an Atlas Cluster may be referenced in a connection string by:
 an alternative to the full connection endpoint list:

        It is worth noting though, whichever approach is used (explicitly defining all endpoints in the connection string or having it discovered via the DNS SRV service name), the connection URL seed list is only ever used for bootstrapping a client application to the database cluster, when the client first starts or when it later needs to restart. On start-up, the client uses the connection seed list to attempt to attach to any member of the cluster, and in fact, all but one of the endpoints could be incorrect and a successful cluster connection will still be achieved. Once the initial connection is made, the true cluster member endpoint list is dynamically and continuously shared between the cluster and the client at runtime. This enables the client to continue operating against the database even if the members of the database cluster change locations or identities over time. For example, after a year of a database cluster and application continuously running, there could be the need to increase database capacity by dynamically rotating the database hosts to new higher processing capacity machines. This all happens dynamically and the already running client application automatically becomes aware and leverages the new hosts without downtime and without needing to consult the connection string again. If the client application restarts though, it will need to read the updated connection string to be able to bootstrap a connection back up to the database cluster.

        In the rest of this post we will explore some of the ways initial client connectivity issues can be diagnosed and resolved when using DNS SRV based connection URLs. For reference, Joe Drumgoole provides a great explanation about how DNS SRV records work more generally, and how MongoDB drivers and tools can leverage these.

        Naive Connectivity Diagnosis

        If you are having connection problems with Atlas when using the SRV service name based URL, be weary of drawing the wrong conclusions regarding the cause of the connection problem...

        For example, lets say you can't connect an application to a cluster with the Atlas advertised URL of 'mongodb+srv://' from your laptop. You may be tempted to try to debug the connection problem by running some of the following commands from your laptop:

        $ ping
        ping: Name or service not known

        $ nc -zv -w 5 27017
        nc: getaddrinfo for host "" port 27017: Name or service not known

        Neither of these work even if you actually do have Atlas connectivity configured correctly. This is because "" is not the DNS name of a specific host endpoint. It is actually used by the MongoDB drivers and tools to dynamically lookup the DNS SRV records which have been populated for a service called ''.

        Useful Connectivity Diagnosis

        As documented in the MongoDB Drivers specification document and the MongoDB Manual, a DNS SRV query is performed by the drivers/tools by prepending the text '_mongodb._tcp.' to the service name. Therefore, to lookup the list of real endpoints for the Atlas cluster from your laptop using the DNS nslookup tool, you should run:

        $ nslookup -q=SRV

        Non-authoritative answer: service = 0 0 27017 service = 0 0 27017 service = 0 0 27017

        You can see that in this case that the database service name maps to 3 endpoints (i.e. the hosts of the 3 replica set members). You can then lookup the actual IP address of any one of these endpoints if you desire:

        $ nslookup

        Non-authoritative answer: canonical name =

        So to now debug your connectivity issue further you can use ping but this time by specifying one of the underlying host server endpoints for the database cluster:

        $ ping -c 3
        PING ( 56(84) bytes of data.
        64 bytes from ( icmp_seq=1 ttl=51 time=10.2 ms
        64 bytes from ( icmp_seq=2 ttl=51 time=9.73 ms
        64 bytes from ( icmp_seq=3 ttl=51 time=11.7 ms

        --- ping statistics ---
        3 packets transmitted, 3 received, 0% packet loss, time 2002ms
        rtt min/avg/max/mdev = 9.739/10.586/11.735/0.850 ms

        If this is successful it still doesn't necessarily mean that you can connect to the database service. The next thing to try is to see if you can actually open a socket connection to the mongod (or mongos) daemon process running on one of the endpoints, which you can achieve from your laptop using the netcat utility:

        $ nc -zv -w 5 27017
        nc: connect to port 27017 (tcp) timed out: Operation now in progress

        If this doesn't connect but you are able to ping the endpoint host (as is the case in this example), it probably indicates that the IP address of your client laptop has not been added to the Atlas project's access list, which is easy to remedy via the Atlas Console:

        Once your laptop has been added to the access list, running netcat again should demonstrate that a socket connection can now be successfully made:

        $ nc -zv -w 5 27017
        Connection to 27017 port [tcp/*] succeeded!

        If this connects, then it is advisable to move on to trying to connect to the database via the Mongo Shell.

        In this example screenshot, the Atlas console suggests the following Mongo Shell command line to use to connect:

         mongo "mongodb+srv://" --username main_user

        With this connection string, some of you may be thinking how does the Shell know to connect to Atlas over SSL/TLS, what replica-set name it should request and what authentication source database it should specify to locate the user's credentials?

        Well, in addition to querying the DNS SRV records for the service, when dynamically constructing the initial bootstrap URL for the cluster, the MongoDB drivers/tools also lookup a DNS TXT record for the service which Atlas also populates for the deployed cluster. This TXT record contains the set of connection options, to be added as parameters to the dynamically constructed connecting string (e.g. 'ssl=true&replicaSet=TestCluster-shard-0&authSource=admin'). You can view what these parameter settings are for a particular Atlas cluster, yourself, by running the following DNS query:

        $ nslookup -q=TXT

        Non-authoritative answer:  text = "authSource=admin&replicaSet=TestCluster-shard-0"

        Note, the default behaviour for MongoDB drivers/tools using a 'mongodb+srv' based URL is defined as to enable SSL/TLS for the connection. As a result, 'ssl=true' doesn't have to be included in the DNS TXT record, as shown in the example above, because the drivers/tools will automatically add this parameter to the connection string on the fly.


        There's other potential causes of MongoDB Atlas connectivity issues that aren't covered in this post, but hopefully the tips highlighted here will help some of you, especially if you are diagnosing problems when using DNS SRV based service names in the connection URLs you use.

        Song for today: Lose the Baby by Tropical Fuck Storm