Paul Done's Technical Blog

New Paper Version Of Practical MongoDB Aggregations Book Now Available

2023-09-17T17:46:00.003+01:00

Just over 2 years ago, I self-published the Practical MongoDB Aggregations eBook, which is also referenced by parts of the MongoDB Manual.

Now, there is a MongoDB Inc. officially endorsed paper and electronic version of the book, published by Packt. The Packt version of the book includes extra information on some topics and two additional example chapters.

You can purchase the book from the Packt website, Amazon, or other book retailers.

This Practical MongoDB Aggregations book helps you unlock the full potential of the MongoDB aggregation framework, including the latest features of MongoDB 7.0. It arms you with practical, easy-to-digest principles and approaches for increasing your effectiveness in developing aggregation pipelines, supported by examples for building pipelines to solve complex data manipulation and analytical tasks.

This book is tailored to developers, architects, data analysts, data engineers, and data scientists with some familiarity with the aggregation framework (it’s not for aggregation beginners). It starts by explaining the framework’s architecture and then shows you how to build pipelines optimized for productivity and scale. Given the critical role arrays play in MongoDB’s document model, the book delves into best practices for optimally manipulating arrays. The latter part of the book equips you with examples to solve common data processing challenges so you can apply the lessons you’ve learned to practical situations.

What You Will Learn

Develop dynamic aggregation pipelines tailored to changing business requirements
Eliminate the performance penalties of processing data externally by filtering, grouping, and calculating aggregated values directly within the database
Master essential techniques to optimize aggregation pipelines for rapid data processing
Achieve optimal efficiency for applying aggregations to vast datasets with effective sharding strategies
Employ MongoDB expressions to transform data and arrays for deeper insights
Secure your data access and distribution with the help of aggregation pipelines

I hope you enjoy the book!

Achieving At Least An Order Of Magnitude Aggregation Performance Improvement By Scaling & Parallelism

2021-12-06T11:54:00.025+00:00

Introduction

When I started my MongoDB Inc career 8 years ago, the 'bootcamp' project topic I chose for self-learning was to investigate how to speed up aggregations via parallelism. Specifically, I investigated the benefits of splitting an aggregation into parts, each running against a subset of the data concurrently. At the time, my study yielded a positive outcome by reducing the response time of a "full collection scan" style aggregation (see my original unsharded and sharded results and write-ups). However, in hindsight, running everything inside a single laptop dulled the impact. In reality, I was probably hitting host machine resource contention (e.g. CPU, RAM, Storage IOPS limits) in a way that wouldn't occur in a real distributed environment.

I thought I’d take the opportunity to revisit this topic:

“Can I improve the performance of a 'full-table-scan' type of aggregation workload by splitting the aggregation into parallel jobs, each operating on a subset of the data?”

This time, I chose to test against a database containing all the movies ever catalogued to calculate the average movie rating across all movies.

I decided to use a remote MongoDB cluster deployment for the test environment, with separate host machines for each replica/shard. Just because it is so easy to rapidly create a production-like environment, I used MongoDB Atlas to provision and host the database cluster. I’d expect to see similar results if I was to run the tests on equivalent hardware for any self-managed version of MongoDB.

The executed tests analyse how the aggregation workload completion time changes when adding more hardware (e.g. CPUs), more shards plus more parallelisation of the aggregation's pipeline, in various combinations.

Data

I created a collection of 100 million ‘movie’ documents by duplicating records from the smaller sample_mflix database sourced from the Atlas sample data set. Approximately ⅓ of the documents in the movies collection have a field called metacritic which provides an aggregated movies rating across many reviews collated by the Metacritic website.

Below is an example of a movie document, from the collection, for one of my favourite films, Drive:

Aggregation

I wanted to compute the average metacritic score across every movie in the database collection. Using the MongoDB Shell (mongosh), I am able to execute the following aggregation pipeline to calculate the average rating across all movies:

use sample_mflix;

pipeline = [

{"$group": {

"_id": "",

"average_rating": {"$avg": "$metacritic"},

}},

];

db.movies.aggregate(pipeline);

[{_id: '', average_rating: 59.387109125717934}]

Adding Parallelism

I then wrote a small Python test utility called Mongo Parallel Agg to achieve parallelism and test the outcome quickly. This utility analyses the movie data set, divides the data set into sections (e.g. 8 parts), and then spawns multiple sub-processes. Each sub-process runs in parallel, targeting one of the subsections of data.

MongoDB has a handy tool for working out the approximate “natural split points” for a given data set, which is the $bucketAuto aggregation operator. The mongo-parallel-agg app uses $bucketAuto to perform an operation similar to the following to understand the shape of the movie data set and identify its subsections (in this case asking for 8 approximately balanced subsections of the titles of movies):

pipeline = [

{"$bucketAuto": {

"$group": "$title",

"buckets": 8

}},

{"$group": {

"_id": "",

"splitPoints": {

"$push": "$_id.min",

}},

{"$unset": [

"_id",

]},

];

db.movies.aggregate(pipeline);

[{

splitPoints: [

'!Women Art Revolution',

'Boycott',

'Exotica',

'Ishqiya',

'Mr Perfect',

'Salesman',

'The Counterfeiters',

"The Strange History of Don't Ask, Don't Tell"

]

}]

As you can see from these results, it's important to analyse the spread of values in a collection to determine its natural divisions. A naïve approach would be to manually divide all the movie titles by their initial letter in the English alphabet (e.g. A-Z). With 26 letters, if you need 8 subsections, you might naïvely split at every 3rd or 4th letter (26 ÷ 8 = 3.25). You would come up with subsections such as "A-C", "S-U", etc. However, as you can tell from the $bucketAuto output above, each subsection would not be evenly balanced for the number of documents it covers. Many more movies begin with the letter "T" than other letters due to titles like "The …". Using uneven subsections of documents for parallel aggregations results in some sub-processes taking far longer than others, prolonging the overall response time of the entire aggregation workload.

The other main 'trick' the mongo-parallel-agg app pulls is to use the identified subsections information to produce multiple aggregation pipelines, one for each sub-process, targeting just a subset of the collection. The aggregation excerpt below provides an example of the pipeline dynamically generated by mongo-parallel-agg to target a subsection of data to help with calculating the average rating:

pipeline = [

{"$match": {

"title": {"$gte": "Boycott", "$lt": "Exotica"}}

{"$group": {

"_id": "",

"total": {"$sum": "$metacritic"},

"count": {"$sum": {"$cond": {

"if": {"$eq": ["$metacritic", null]},

"then": 0,

"else": 1

}}},

}},

];

db.movies.aggregate(pipeline);

[{ _id: '', total: 732084631, count: 12497379}]

As you can see, the pipeline uses a $match stage to target a subsection of the data for analysis. The other main change is that the pipeline no longer uses the $avg operator to calculate an average. Instead, the pipeline includes two computed fields, one for the total and one for the count, which each use the $sum operator. This change is required because, mathematically, calculating an average of averages will yield an invalid result. In this solution, the mongo-parallel-agg Python code performs the "last-mile" average computation by summing the totals produced by each sub-process. It then divides this grand total by the sum of all the counts calculated by each sub-process to determine the average. Also, you'll notice that the pipeline must handle ignoring documents for the count field if the field (metacritic) doesn't exist in a document (the previously used $avg operator automatically did this and so there was no need for a check).

To optimise each sub-process pipeline, the mongo-parallel-agg app first ensures a compound index exists for title & metacritic. This enables the $match part of the pipeline to target the index. It also enables the aggregation to be covered for increased efficiency because the only other field analysed (metacritic) belongs to the same index.

Purely for convenience, the mongo-parallel-agg demo app performs these two actions every time it executes before running the main aggregation workload against the collection (i.e. performing $bucketAuto analysis and ensuring an index exists). In a real production system, both these actions would be performed once or infrequently and not every time the aggregation workload runs. For this reason, the app doesn't start its aggregation execution timer until after it completes these two actions.

Results

The following table shows the execution times, in seconds, for an aggregation computing the average movie rating when run against different host environment topologies with varying levels of parallelism:

Per replica host specification: Intel Xeon processor with a maximum speed of 3.1 GHz, 512 TB storage with 3000 non-provisioned IOPS

Observations

Scalability. The results show the solution is able to achieve scalability in multiple individual dimensions:

Vertical Scaling. The execution time is reduced by adding more vCPUs and making no other changes. Note, not evident in the displayed results is the effect of adding more RAM. In some cases, the addition of RAM does actually have an impact. See the later bullet-point titled "Increasing RAM For Analytics May Help" for more detail.
Horizontal Scaling. The execution time is reduced by adding more shards and making no other changes.
Parallel-processing Scaling. The execution time is reduced by splitting the aggregation into parallel sub-processes, each acting on a subset of records and making no other changes.

Combined Scaling. Overall, by combining the benefits of all three scaling dimensions, the solution manifests two orders of magnitude of reduction in execution time - from 908 seconds (over 15 minutes) down to 9 seconds.
Non-Linear Scaling. The scaling exhibited isn't linear, but I wouldn't expect this because map-reduce style workloads, such as calculating averages, will never scale 100% linearly. Such workloads must serialise parts of the computation to accumulate partial results together in one place and sum them together.
Unexpected Degree Of Speed-Up From 1 to 2 Sub-Processes. In each situation where there is a transition from a single aggregation process to two parallel sub-processes, there is at least a 6x speed-up. This difference is far more significant than the typical "best-case" linear (2x) speed-up that could have realistically been hoped for. This puzzled me for quite a while, but I eventually realised why this occurs. See the later section titled "The Slow Single-Threaded Result Conundrum" for the reason why.
Optimal Sub-Process to CPU Mapping. The optimal number of sub-processes to execute for this particular workload appears to be roughly 1 to 2 times the total vCPUs available. For example, in the first result row in the table (M40, 1 shard), for a total of 4 vCPUs, 8 sub-processes yields the quickest result. Another example is visible in the penultimate table row (M60, 2 shards). Here, the solution yields the quickest result for a total of 32 vCPUs when running either 32 or 64 sub-processes. Overall, we can infer that at some point, between 1 and 2 sub-processes per vCPU, the benefits of multiprocessing are outweighed by the overhead of facilitating so many processes.
Increasing RAM For Analytics Can Help. The table's results for the M60 two-shards and four-shards configurations, for the single sub-process tests, do not capture the full picture. In both cases, the response time significantly decreases (not shown in the table) when running the test configuration for the second time with the same aggregation pipeline. For the M60 two-shard single-process test (with a combined RAM total of 128GB), the response time was reduced from 557 seconds to 438 seconds. For the M60 four-shard single-process test (with a combined RAM total of 256GB), the response time was reduced from 340 seconds to 102 seconds. This latency drop occurs because, for these configurations, the size of RAM available across the shards is approaching the size of the full collection. Consequently, a significant portion of the data is fetched directly from RAM rather than from disk, because the data is already present in memory following the first test run. The remaining four smaller test configurations exhibited no noticeable difference in execution times between first and second runs.
Increasing Storage IOPS Was Not Tested. I employed the tests to analyse the effects of scaling parallel processing factors such as the number of CPU cores, shards and sub-processes. I expect the results to be even better when increasing the storage IOPS allocated to the host machines. This increased storage bandwidth should enable an aggregation to pull the scanned collection data from storage quicker. However, this aspect wasn't under consideration here and wasn’t tested, and so I've left it as an exercise for the reader.

The Slow Single-Threaded Result Conundrum

So why is there a jump down of at least 6x lower latency when going from a single process to two parallel sub-processes to calculate the average movie rating for all movies?

Having run all the tests, I investigated deeper and realised why this phenomenon occurred. I'd initially optimised the mongo-parallel-agg app when splitting an aggregation into multiple pipelines, to each performing a $match for a subset of data. I’d realised that each pipeline was no longer performing a "full collection scan" and would benefit from an index on the title field being used by the $match filter. Additionally, I’d realised that the only other field used by these "split aggregations" was the metacritic field from which the average is calculated. Therefore, rather than using a simple index on title, I’d employed a compound index on title & metacritic to cover the query part of the aggregation. Consequently, each aggregation sub-process was able to fully leverage an index and didn’t need to scan each full document (so full documents were not pulled from disk). Also, the index is far smaller than the full collection's size, and therefore, the database can rapidly serve the index’s data from RAM.

During the tests, I’d not considered whether I could apply some of these same benefits when just running the entire aggregation pipeline as a single process. It was only later that I realised I could also optimise the original “full-collection” aggregation with a small “hack”, inspired by how I’d originally optimised the divided pipelines for sub-processing. Upon testing this hack for just the M60 two-shard configuration (single process), I obtained the following results:

557 seconds (mostly on disk) → 448 seconds (significantly in RAM) → 92 seconds (with the pipeline hack)

My hack involved refactoring the aggregation pipeline that the application generates for the single-threaded processing option, as shown below:

Using the “filter documents greater than MinKey” hack, I saw that the aggregation performs an index scan rather than a full-collection scan. The runtime invariably pulls this [far smaller] index from RAM rather than disk. The aggregation is covered, locating all the required title and metacritic values from the index with no need to examine each underlying document in the collection.

Interestingly, before including my hack, I first tried defining just a simple index on metacritic and using a hint to force the “full aggregation” to use this index for a covered query. However, this didn’t yield any performance benefits and it was even slower.

Unfortunately, I’d run out of time to re-test all the deployment scenarios with this improved single-process pipeline. Therefore, the results table reflects the original results before this final improvement. I have since folded the code refactoring into the Github codebase for the mongo-parallel-agg project so that others can leverage this optimisation in the future.

Also, upon reflection, instead of using the “match documents greater than MinKey” filter in a new $match stage, I suspect I could probably use something like the $exists operator instead in the new $match stage. My guess is this would also ensure the aggregation pipeline leverages the compound index, and as a covered query, rather than performing a full collection scan. Additionally, it may also be the case that using a sparse index or compound index provides further improvement for all the aggregations executed, whether parallelised or not (to be determined). I will leave these two elements for the reader to test.

EDIT: 7-Dec-2021: Prompted by a colleague, Chris Harris, I tried employing the hint mechanism again since publishing this blog post. This time, using a hint did work, and the single-threaded aggregation leveraged an index as a covered query. On the previous occasion I'd tried this, I suspect I'd introduced a typo when referencing the index from the hint. In conclusion, there appears to be a few different options for inducing an index scan and covered query to occur. For the background on why you explicitly have to force an index scan to be used, rather than a collection scan when there is no find()/$match filter defined, review the server ticket 20066.

Summary

I've shown here that, for a specific aggregation, I can reduce the time taken to calculate the average rating across 100 million movies on "mid-range" hardware by two orders of magnitude. I achieved this by scaling vertically (adding more CPUs), scaling horizontally (adding more shards), and by increasing the number of processes run in parallel. Also, some pipeline tweaking and index tuning helped.

These findings don't guarantee that every type of aggregation workload will scale to the same degree, or even at all when applying similar configuration changes. The impact will depend on the nature of the aggregation pipeline.

I'd be remiss not to recommend that before trying the scaling optimisations outlined here, you should first ensure you have optimised your aggregation pipeline more generally. Scaling comes with an increased infrastructure cost which can often be avoidable. Also, the act of running more sub-processes for each aggregation could drain computation power you’d previously “allocated” to other types of workloads running against the database. My book, Practical MongoDB Aggregations, and specifically the following sections, outlines some of the techniques to use to optimise your aggregations without applying scaling changes:

Lastly, is the obvious question about whether I believe it is possible for the movie average rating calculation to take less than one second to complete for 100 million movies? The answer is absolutely yes, with sufficiently increased CPUs, RAM, storage IOPS, shards and parallel sub-processes. However, I suspect it may be a while before I find out for sure because the required uplift in hardware is likely to be cost-prohibitive for me at least.

Song for today: Carol by The Peep Tempel

New MongoDB Aggregations book is out

2021-05-15T11:04:00.002+01:00

My book, Practical MongoDB Aggregations, was published this week.

The book is available electronically for free for anyone to use at: https://www.practical-mongodb-aggregations.com

This book is intended for developers, architects, data analysts, data engineers, and data scientists. It aims to improve your productivity and effectiveness when building aggregation pipelines and help you understand how to optimise their pipelines.

The book is split into two key parts:

A set of tips and principles to help you get the most out of aggregations.
A bunch of example aggregation pipelines for solving common data manipulation challenges, which you can easily copy and try for yourself.

I hope readers get some good value from it!

Song for today: Earthmover by Have a Nice Life

MongoDB Reversible Data Masking Example Pattern

2021-02-27T10:15:00.007+00:00

Introduction

In a previous blog post I explored how to apply one-way non-reversible data masking on a data-set in MongoDB. Here I will explore why, in some cases, there can be a need for reversible data masking, where, with appropriate privileges, each original record's data can be deduced from the masked version of the record. I will also explore how this can be implemented in MongoDB, using something I call the idempotent masked-id generator pattern.

To accompany this blog post, I have provided a GitHub project which shows the Mongo Shell commands used for implementing the reversible data masking pattern outlined here. Please keep referring to the project’s README as you read through this blog post, at:

https://github.com/pkdone/mongo-data-mask-reversible

Why Reversible Data Masks?

In some situations a department in an organisation, that masters and owns a set of data, may need to provide copies of the whole or subset of the data to a different department or even a different partner organisation. If the data-set contains sensitive data, like personally identifiable information (PII) for example, the 'data-owning' organisation will first need to perform data masking on the values of the sensitive fields of each record in the data-set. This redaction of fields will often be one-way (irreversible) preventing the other department or partner organisation from being able to reverse engineer the content of the masked data-set, to retrieve the original sensitive fields.

Now consider the example where the main data-owning organisation is collecting results of 'tests'. The results could be related to medical tests where the data-owning organisation is a hospital for example. Or the results could be related to academic tests where the data-owning organisation is a school for example. Let's assume that the main organisation needs to provide data to a partner organisation for specialist analysis, to identify individuals with concerning test results patterns. However there needs to be assurance that each individual's sensitive details or real identity is not shared with the partner organisation.

How can the partner organisation report back to the main organisation flagging individuals for concern and recommended follow-up without actually having access to those real identities?

One solution is for the redacted data-set that is provided to the partner organisation, to carry an obfuscated but reversible unique identity field as a substitute for the real identity, in addition to containing other irreversibly redacted sensitive fields. With this solution, it would not be possible for the partner organisation to reverse engineer the substituted unique identity, to a real social security number, national insurance number or national student identifier, for example. However, it would be possible for the main data-owning organisation to convert the substituted unique id back to the real identity, if subsequently required.

The diagram below outlines the relationship between the data-owning organisation, the partner organisations and the data masked data-sets shared between them.

A partner organisation can flag an individual of concern back to the main organisation, without ever being able to deduce the real life person who the substituted unique ID maps to.

How To Achieve Reversibility With MongoDB?

To enable a substituted unique ID to be correlated back to the original real ID of a person, the main data-owning organisation needs to be able to generate the substitute unique IDs in the first place, and maintain a list of mappings between the two, as shown in the diagram below.

The stored mappings list needs to be protected in such a way that only specific staff in the data-owning organisation, with specific approved roles, have access to it. This prevents the rest of the organisation and partner organisations from accessing the mappings to be able to reverse engineer masked identifies back to real identities.

Essentially, the overall process of masking and 'unmasking' data with MongoDB, as shown in the GitHub project accompanying this blog post, is composed of three different key aggregation pipelines:

Generation of Unique ID Mappings. A pipeline for the data-owning organisation to generate the new unique anonymised substitute IDs for each person appearing in a test result, into a new mappings collection using the idempotent masked-id generator pattern
Creation of the Reversible Masked Data-Set. A pipeline for the data-owning organisation to generate a masked version of the test results, where each person's id has been replaced with the substitute ID (an anonymous but reversible ID); additionally some other fields will be filtered out (e.g. national id, last name) or obfuscated with partly randomised values (e.g data of birth).
Reverse Engineer Masked Data-Set Back To Real Identities. An optional pipeline, if./when required, for the data-owning organisation to be able to take the potentially modified partial masked data-set back from the partner organisation, and, using the mappings collection, reverse engineer the original identities and other sensitive fields.

The screenshot below captures an example of the outcome of steps 1 and 2 of the process outlined above.

Here, each person's ID has been replaced with a reversible substitute unique ID. Additionally, the date of birth field ('dob') has been obfuscated (shown with the red underline) and some other sensitive fields have been filtered out.

I will now explore how each of the three outlined process steps is achieved in MongoDB, in the following three sub-sections.

1. Generation of Unique ID Mappings

As per the companion GitHub project, the list of original ID to substitute ID mappings is stored in a MongoDB collection with very strict RBAC controls applied. An example record in this collection might look like the one shown in the screenshot below.

Here the collection is called masked_id_mappings, where each record's field '_id' contains a newly generated substitute ID, based on a generated universally unique identifiers (UUIDs). The field 'original_id' contains the real identifier of the person or entity in the same format it was in, in the original data-set. For convenience, two date related attributes are included in each record. The 'date_generated' field is generally useful for tracking when the mapping was created (e.g. for reporting), and the 'date_expired' is associated with a time-to-live (TTL) index to enable the mapping to be automatically deleted by the database, after a period of time (3 years out, in this example).

The remaining field, 'data_purpose_id' is worthy of a little more detailed discussion. Let's say the same data-set needs to be provided to multiple 3rd parties, for different purposes. It makes sense to mask each copy of the data differently, with different unique IDs for the same original IDs. This can help prevent the risk of any potential future correlation of records between unrelated parties or consumers. Essentially when a mapping record is created, in addition to providing the original ID, a data purpose 'label' must be provided. A unique substitute ID is generated for a given source identity, per data use/purpose. For one specific data consumer purpose, the same substituted unique ID will be re-used for the same reoccurring original ID, However, a different substituted unique ID will be generated and used for an original ID when the purpose and consumer requesting a masked data-set is different.

To populate the masked_id_mappings collection, an aggregation pipeline (called 'maskedIdGeneratprPipeline') is run against each source collection (e.g. against both the 'persons' collection and the 'tests' collection). This aggregation pipeline implements he idempotent masked-id generator pattern. Essentially, this pattern involves taking each source collection, picking out the original unique id and then placing a record of this, with a newly generated UUID it is mapped to (plus the other metadata such as data_purpose_id, date_generated, date_expired), into the masked_id_mappings collection. The approach is idempotent in that the creation of each new mapping record is only fulfilled if a mapping doesn't already exist for the combination of the original unique id and data purpose. When further collections in the data-set are run through the same aggregation pipeline, if some records from these other collections have the same original id and data purpose as one of the records that already exists in the masked_id_mappings collection, a new record with a new UUID will not be inserted. This ensures that, per data purpose, the same original unique id is always mapped to the same substitute UUID, regardless of how often it appears in various collections. This idempotent masked-id generator process is illustrated in the diagram below.

The same aggregation pipeline is run multiple times, once against each source collection which belongs to the source data-set. Even if the source data-set is ever added to in the future, the aggregation can be re-run against the same data-sets, over and over again, without any duplicates or other negative consequences, and with only the additions being acted upon. The pipeline is so generic that it can also be run against other previously unseen collections which have completely different shapes but where each contains an original unique ID in one of its field.

2. Creation of the Reversible Masked Data-Set

Once the mappings have been generated for a source data-set, it is time to actually generate a new masked set of records from the original data-set. Again, this is achieved by running an aggregation pipeline, once per different source collection in the source data-set. The diagram below illustrates how this process works. The aggregation pipeline takes the original ID fields from the source collection, then performs a lookup on the mappings collection (including the specific data purpose) to grab the previously generated substitute unique ID. The pipeline then replaces the original IDs with the substitute IDs, in the outputted data masked collection

The remaining part of the aggregation pipeline is less generic and must contain rules distinct to the source data-set it operations on. The latter part of the pipeline contains specific data masking actions to apply to specific sensitive fields in the specific data-set.

The generated masked data-set collections can then be exported ready to be shipped to the consuming business unit or 3rd party organisation, who can then import the masked data-set into their own local database infrastructure, ready to perform their own analysis on.

3. Reverse Engineer Masked Data-Set Back To Real Identities

In the example 'test results' scenario, the partner organisation may need to subsequently report back to the main organisation flagging individuals for concern and recommended follow-up. They can achieve this by providing the substituted identities back to the owning organisation, with the additional information outlining why the specific individuals have been flagged. The GitHub project accompanying this blog post shows an example of performing this reversal, where some of the 'tests' collection records in the masked data-set have been marked with the following new attribute by the 3rd party organisation:

'flag' : 'INTERVENTION-REQUIRED'

The GitHub project then shows how the masked and now flagged data-set, if passed back to the original data-owning organisation, can then have a 'reverse' aggregation pipeline executed on it by the original organisation. This 'reverse' aggregation pipeline looks up the mappings collection again, but this time to retrieve the original ID using the substitute unique IDs provided in the input (plus the data purpose). This results in a reverse engineered view of the data with real identities flagged, thus enabling the original data-owning organisation to schedule follow-ups with the identified people.

Summary

In this blog post I have explored why it is sometimes necessary to apply data masking to a data-set, for the masked data-set to be subsequently distributed to another business unit or organisation, but where the original identities can be safely reversed engineered. This is achieved with the appropriate strong data access controls and privileges in place, for access by specific users in the original data-owning organisation only, if the need arises. As a result, no sensitive data is ever exposed to the lesser trusted parties.

Song for today: Lullaby by Low

MongoDB Data Masking Examples

2021-02-10T12:40:00.027+00:00

Introduction

Data Masking is a well established approach to protecting sensitive data in a database yet allowing the data to still be usable. There are a number of reasons why organisations need to employ data masking, with two of the most common being:

To obtain a recent copy of data from the production system for use in a test environment, thus enabling tests to be conducted on realistic data which better represents the business, but with the most sensitive parts redacted. Invariably, there is an increased level of reliability in the results, when tests are applied on real data, instead of using synthetically generated data.
To provide an API for consumers to access live production data, but where the consumer’s role or security level means they are not permitted to see values of all the fields in each record. Instead the subset of protected fields are represented by obfuscated values that still carry meaning (e.g the fact that a field existed or that a date field’s value has an approximate rather than random date).

In this blog post I will show how MongoDB’s powerful aggregation pipelines can be used to efficiently mask data belonging to a MongoDB collection, with various examples of obfuscating/redacting fields. The power of MongoDB’s aggregation pipelines doesn’t necessarily come from any inherent ease of use, compared to say SQL. Indeed its learning curve is not insignificant (although it is far easier to learn and use than say building raw map-reduce code to run against a large data set in an Hadoop cluster, for example). The real power of aggregation pipelines comes from the fact that once a pipeline is defined, a single command can be issued to a MongoDB cluster to then be applied to massive data sets. The cluster might contain a database of billions or more of records, and the aggregation pipeline will be automatically optimised and executed, including transparently processing subsets of the data in parallel across each shard, to reduce the completion time.

To accompany this blog post, I have provided a GitHub project which shows the Mongo Shell commands used for defining the example aggregation pipeline and masking actions that I will outline here. Please keep referring to the project’s README as you read through this blog post, at:

https://github.com/pkdone/mongo-data-masking

The approach that I will describe provides examples of applying irreversible and non-deterministic data masking actions on fields. That is to say, as a result of applying the outlined data masking techniques, it would be extremely difficult, and in most cases impossible, for a user to reverse engineer or derive the original values of the masked fields.

[EDIT: 27-Feb-2021: For reversible data masking, see the part 2 blog post on this topic]

Sample Payments Data & Data Masking Examples

For these examples I will simulate a ‘payments system’ database containing a collection of ‘card payments’ records. Such a database might be used by a bank, a payment provider, a retailer or an eCommerce vendor, for example, to accumulate payments history. The example data structures here are intentionally over-simplified for clarity. In the screenshot below you can see an example of 2 sample payments records, in their original state, shown on the left hand side. On the right hand side of the screenshot you can see the result of applying the data masking transformation actions from the GitHub project, and which I will describe in more detail further below.

For the example project, 10 different fields have data masking actions applied. Below is a summary of each masking action, the field it is applied to and the type of change applied. You can view and cross reference these to the 10 corresponding aggregation pipeline operations provided in the companion GitHub project:

Replace the card’s security code with a different random set of 3 digits (e.g. 133 → 472)
For the card number, obfuscate the first 12 digits (e.g. 1234567890123456 → XXXXXXXXXXXX3456)
For the card’s listed owner name, obfuscate the first part of the name, resulting in only the last name being retained (e.g. 'Mrs. Jane A. Doe' → 'Mx. Xxx Doe')
For the payment transaction’s recorded date-time, obfuscate the value by adding or subtracting a random amount of time to the current value, by up to one hour maximum (e.g. 00:23:45 on 01-Sep-2019 → 23:59:33 on 31-Aug-2019)
Replace the payment settlement's recorded date-time by taking an arbitrary fixed time (e.g. 29-Apr-2020) and adding a random amount of time to that, up to one year maximum (e.g. 07:48:55 on 15-Dec-2018 → 21:07:34 on 08-Jun-2020)
Replace the payment card expiry date with the current date (e.g. 10-Feb-2021) + a random amount of days of up to one year maximum (e.g. 31-Mar-2020 → 31-Nov-2021)
For the transaction’s payment amount, obfuscate the value by adding or subtracting a random percent to its current value, by up to 10% maximum (e.g. 49.99 → 45.51)
Replace the payment’s ‘reported’ status boolean value with a new randomly generated true or false value where there is a 50:50 chance of the result being either value (e.g. false → true, false → false)
Replace the payment transaction’s ID (which is composed of 16 hexadecimal digits), with a new 16 hexadecimal digit value based on an ‘MD5 hash’ of the original ID (note, do not regard this as 'cryptographically safe')
For the extra customer info sub-document composed of 3 fields, only retain this sub-document and its fields if the value of its ‘category’ field is not equal to the text ‘SENSITIVE’ (i.e. redact out a sub-section of the document where the customer is marked as ‘sensitive’)

Exposing Masked Data To Consumers

Once this pipeline has been built, it can be used in one of four ways in MongoDB depending on your specific needs (again the companion GitHub project provides more specific details on how each of these four ways are achieved in MongoDB):

DATA MASKED AGGREGATION ON DEMAND. Enable the aggregation pipeline to be executed on demand (by calling db.aggregate(pipeline) from a trusted mid-tier application, where the application would have rights to see the non-obfuscated data too. As a result the trusted mid-tier would need to be relied on to return just the result of the ‘data masking aggregation’ to consumers, and to never expose the underlying unmasked data, in the database, in any other way.
DATA MASKED READ-ONLY VIEW. Create a view (e.g. payments_redacted_view) based on the aggregation pipeline and then, using MongoDB’s Role Based Access Control (RBAC) capabilities, only allow consumers to have access to the view, with no permissions to access the underlying source collection, thus ensuring consumers can only access the ‘data masked’ results. Consuming applications can even use a ‘find’ operation, with a ‘filter’ and ‘projection’, when querying the view, to reduce down further the fields and records they want to see.
DATA MASKED COPY OF ORIGINAL DATA. Execute the aggregation pipeline using an additional $merge pipeline stage, to produce a brand new collection (e.g. payments_redacted), which contains only the data masked version of the original collection. The original collection can then either be locked down as not visible to the consumer, using RBAC, or can even just be deleted, if it is no longer required. The latter option is often applicable if you are producing a test environment with a masked version of recent production data, where the original raw data is no longer needed.
DATA MASKED OVERWRITTEN ORIGINAL DATA. Execute the aggregation pipeline using an additional $merge pipeline stage, to overwrite records in the existing collection with modified versions of the corresponding records. The resulting updated collection will now only contain the masked version of the data. Again, this may be applicable in the case where you are producing a test environment with a masked version of recent production data, where the original collection is no longer needed.

Summary

In this post and the companion GitHub project, I’ve shown how common irreversible obfuscation patterns can be effectively defined and then applied to mask sensitive data in a MongoDB collection. The example shows only a couple of records being redacted. However, the real power comes from being able to apply the same aggregation pipeline, unchanged, to a collection containing a massive data set, often running across multiple shards, which will automatically be parallelised, to reduce turnaround times for provisioning masked data sets.

Song for today: The Black Crow by Songs: Ohia (Jason Molina)

Is Querying A MongoDB View Optimised?

2020-11-24T14:49:00.011+00:00

Views in MongoDB appear to database users like read-only collections, ready to be queried in the same way normal collections are. A View is defined by an Aggregation pipeline and when a query is issued on a View, using find(), there is the potential for the execution of the View to be optimised by MongoDB in the same way as MongoDB would optimise any aggregation pipeline that is executed.

In reality, most applications will not issue a find() without specifying a query filter as an argument. This begs the question: When issuing a find() with a query filter against a View (backed by an aggregation pipeline), how is the combination optimised, and can indexes be leveraged effectively?

In the rest of this post, I will explore this further and answer this question.

Source Collection Data

The data I am using for the investigation is a music based data-set sourced from the Discogs website, imported from Discog's XML data dump, using an XML MongoDB import utility.

The resulting releases collection, representing the albums and singles released by all artists, has over 1.5 million documents in it. I've defined various obvious indexes for the collection in anticipation of wanting to run finds and aggregations efficiently against it. Below is a screenshot showing some of the data in this collection, illustrating each document's typical shape...

As you can see, the releases collection contains fields for the artist, the title of the release, the year of the release and the music genres & styles associated with the release.

Let's now look at using two different Views, with different degrees of complexity, against this same collection, to see if and how these Views are optimised at runtime, when a find() is issued...

Using A View Which Filters Out Some Records & Fields

So let's create a View which only shows music released since the start of the year 2000, concatenates the array of one or more styles into a new 'style' string field and then excludes the 'styles' and '_id' fields from the result.

var pipeline = [

{$match: {'year': {'$gte': 2000}}},

{$set: {'style': {

$reduce: {

input: '$styles',

initialValue: '',

in: {$concat: ['$$value', '$$this', '. ']}

}

}}},

{$unset: ['styles']},

{$unset: ['_id']},

];

db.createView('millennium_releases_view', 'releases', pipeline);

Below is an example of the shape of result documents, when the View is queried for a specific artist:

If I ask MongoDB to provide the explain plan for an 'empty' query on the View, using the following command...

db.millennium_releases_view.find().explain();

...the resulting explain plan shows the database runs the following steps in the order shown:

MATCH using INDEX SCAN hitting an index for the Year field
SET new Style string field to concatenate values from existing Styles array field
UNSET Styles array field
UNSET _id field

It's good to see here that the 'year greater than or equal' clause in the aggregation pipeline defined for the View is being run as the first step and is targeting an index to avoid a 'full table scan'. However, what happens when I include a query filter when issuing a find() against the View, to only show releases for a specific artist?

db.millennium_releases_view.find({'artist': 'Fugazi'}).explain();

This time the resulting explain plan shows the following steps executed:

MATCH using INDEX SCAN hitting a compound index composed of both the Artist & Year fields
SET new Style string field to concatenate values from existing Styles array field
UNSET Styles array field
UNSET _id field

This is great news, because when I am specifying a query filter for the find() on this View, the optimiser is converting the regular find() filter syntax into an aggregation match expression and pushing it to the existing $match stage at the start of the pipeline. As a result, the optimum compound index of (artist, year) is being used, to entirely satisfy the find's 'artist=Fugazi' expression combined with the View's 'year>=2000' expression,.

Does this mean a find() with a query filter will always be pushed to the top of the View's aggregation pipeline, at runtime?

Well actually, no. Let's see why, in this second example...

Using A View Which Rolls Up Some Data

This time let's create a View which groups releases (albums & singles) for each artist by the style associated with the release. For example, if an artist has five albums categorised with the style 'Stoner Rock' and 7 albums categorised by 'Post Rock', the resulting View will contain 2 documents for the artist, one for each of the two styles. This is the command for creating this View:

var pipeline = [

{$unwind: {path: '$styles'}},

{$group: {

_id: {artist: '$artist', style: '$styles'},

titles: {'$push': '$title'},

}},

{$set: {'artist': '$_id.artist'}},

{$set: {'style': '$_id.style'}},

{$unset: ['_id']},

];

db.createView('styles_view', 'releases', pipeline);

Below is an example of the shape of result documents from querying this new second View, for a specific artist:

If I ask MongoDB to provide the explain plan for an 'empty' query on this View, using the following command...

db.styles_view.find().explain();

...the resulting explain plan shows the database runs the following steps in the order shown:

COLLECTION_SCAN with PROJECTION of Artist, Styles & Title fields only
UNWIND of Styles array field producing a record for each array element
GROUP on Artist + Style fields, adding each associated release title to a new Titles array field
SET Artist string field to the first of element of the group's id
SET Style string field to the second of element of the group's id
UNSET _id field which was created by the group stage

As expected here, because the aggregation pipeline defined for the View does not contain a $match, the first step will result in a 'full table scan', where all the documents in the collection are inspected, and then the required fields only, are projected out.

What happens this time when I include a query filter for the find() run against the View, to only show results for a specific artist, using the following command to explain?

db.styles_view.find({'artist': 'Fugazi'}).explain();

This time, the resulting explain plan shows the following ordered steps executed:

COLLECTION_SCAN with PROJECTION of Artist, Styles & Title fields only
UNWIND of Styles array field producing a record for each array element
GROUP on Artist + Style fields, adding each associated release title to a new Titles array field
SET Artist string field to the first of element of the group's id
MATCH on Artist filed (no index used)
SET Style string field to the second of element of the group's id
UNSET _id field which was created by the group stage

Here the new $match generated by MongoDB to capture the find() expression run against the View, is included in the executed aggregation pipeline, but the $match cannot be pushed all the way up to the first step of the pipeline. This is to be expected...

Essentially what happens when a find() with filter is run on a View is as follows. The filter expression is initially placed in a new $match stage appended to the end of the aggregation pipeline. Then the normal aggregation pipeline runtime optimiser kicks in and attempts to move the newly added $match step as near to the top of the pipeline as possible. However, the $group stage (and related $set on artist, in this case), acts as a barrier. The $group operator stage completely changes the shape of documents and effectively drops any existing fields that preceded it. The optimiser has no way of knowing that a filter on an artist field, being applied to the outcome of a View, is definitively referring to a field called artist that existed in the original source collection used by the View. Instead, for all it knows, the expression on artist could be referring to some other intermediate aggregation pipeline field of similar name. In the example above, even if we don't use $set in the View's pipeline to set a new field called artist, the new $match expression is still blocked by $group and so is only executed straight after $group (a scenario which I also tested).

So even though I only want to see the results for one artist which relates to only a few 10s of documents in the database, the find() which applies a filter on the View, will result in the total data set of 1.5 million documents being 'full table scanned', adding considerable latency to the response.

If I wasn't querying the View and instead running my own hand-crafted aggregation pipeline directly against the source collection, to achieve the same functional outcome, my pipeline could be composed of the following stages where I explicitly include the match on artist as the first stage:

var pipeline = [

{$match: {'artist': 'Fugazi'}},

{$unwind: {path: '$styles'}},

{$group: {

_id: {artist: '$artist', style: '$styles'},

titles: {'$push': '$title'},

}},

{$set: {'artist': '$_id.artist'}},

{$set: {'style': '$_id.style'}},

{$unset: ['_id']},

];

db.releases.aggregate(pipeline);

Then when I ask for the explain plan...

db.releases.explain().aggregate(pipeline);

...I see that the following steps are executed:

MATCH using INDEX_SCAN on the Artist field with PROJECTION of Artist, Styles & Title fields only
UNWIND of Styles array field producing a record for each array element
GROUP on Artist + Style fields, adding each release title to a new Titles array field
SET Artist string field from the first of element in the group id
SET Style string field from the second of element in the group id
UNSET _id field which was created by the group stage

This time an index will be leveraged so that only the few 10s of records, corresponding to the desired artist, are retrieved, ready for unwinding and grouping. The aggregation does not attempt to grab 1.5 million records. This is only possible because, as the developer of the aggregation pipeline logic, I have extra knowledge which the MongoDB runtime does not have. Specifically, I know that the $match on the artist field should actually be applied to the field named artist in the View's source collection and not to the result of the $group stage.

Wrapping Up

What these findings show for Views is that at runtime, when MongoDB receives a find() containing query filter expressions, these expressions are dynamically appended to the end of the View's aggregation pipeline, before the resulting composite pipeline is executed. Then, as is the case if you are just issuing a regular aggregation against a normal collection, MongoDB's aggregation pipeline runtime optimiser attempts to re-order the pipeline on the fly, without changing its functional behaviour, to be more efficient. These runtime optimisations include attempting to push any $match stages as near to the start of the pipeline as possible, to help promote maximum use of indexes when executed. However, stages like the $group stage, which completely transform the shape of documents, mean that the optimiser cannot move a $match ahead of such stages, without risking changing the functional behaviour and ultimately the resulting output.

In practice, where Views are used to filter a subset of records and/or a subset of fields. the system should be able to fully optimise the find() run against the collection, pushing query filter expressions to the first step of the executed aggregation pipeline, to best leverage indexes. Only in places where there is a loss of fidelity (e.g when using a $group stage), will it be the case that the find() query filter cannot be placed earlier in the pipeline being executed against the View.

Song for today: Runaway Return by Fugazi

Rust & MongoDB - Perfect Bedfellows

2020-10-04T15:31:00.009+01:00

I've been learning Rust over the last month or so and I'm really enjoying it. It's a really elegant and flexible programming language despite being the most strongly typed and compile-time strict programming language I've ever used (bearing in mind I used to be a professional C & C++ developer way back in the day).

I'd recently read the really good and commonly referenced blog post Creating a REST API in Rust with warp, which shows how to create a simple example Groceries stock management REST API service, and which uses an in-memory HashMap as its backing store. As part of my learning I thought I'd have a go at porting this to use MongoDB as its data store instead, using the fairly new MongoDB Rust Driver.

It turns out that this was really easy to do, also due to how well engineered the new MongoDB Rust Driver turned out to be, with its rich yet easy to use API.

You can see my resulting MongoDB version of this sample Groceries application, in the Github project rust-groceries-mongo-api I created. Check out that project link to view the source code showing how MongoDB was integrated with for the Groceries REST API and how to test the application using a REST client.

What was even more surprising was how easy it was to integrate MongoDB's flexible data model with a programming language as strict as Rust, and I encountered no friction between the two at all. In fact, this was even easier to achieve by leveraging the option of using the driver team's additional contribution of BSON translation to the open source Rust Serde framework, which makes it easy to serialize/deserialize Rust data structures to/from other formats (e.g. JSON, Avro and now BSON).

I plan to blog again in the future, in more detail, about how to combine Rust's strict typing and MongoDB's flexible schema, especially when the data model and consuming microservices inevitably change over time. [UPDATE 09-Dec-2020: I have now blogged on this at MongoDB DevHub, see: The Six Principles for Building Robust Yet Flexible Shared Data Applications]

Song for today: Dissolution by Cloud Nothings

Converting Gnarly Date Strings to Proper Date Types Using a MongoDB Aggregation Pipeline

2020-05-03T10:30:00.004+01:00

Introduction
I recently received some example bank payments data in a CSV file which had been exported from a relational database with that database's default export settings. After using mongoimport to import this data 'as-is', into a MongoDB database, I noticed that there was a particularly gnarly date string field in each record. For example:

23-NOV-20 22.57.36.827000000

Why do I say gnarly? Well if you lived through Y2K you should be horrified by the 'year' field shown above. How would you know from the data, without any context, what century this applies to? Is it 1920? Is it 2020? Is it 2120? There's no way of knowing from just the exported data alone. Also, there is no indication of which time zone this applies to. Is it British Summer Time? Is it Eastern Daylight Time? Who knows? Also the month element appears to be an abbreviation of a month expressed in a specific spoken language. Which spoken language?

I needed to get this into a proper Date type in MongoDB so I could then easily index it, perform date range queries natively, perform sort by date natively, etc.. My usual tool of choice for this is MongoDB's Aggregation pipeline to generate a new collection from the existing collection with the 'date' string fields converted to proper date type fields. To perform the string to date conversion, the usual operator of choice to use is $dateFromString (introduced in MongoDB 3.6).

However, $dateFromString [rightly] expects an input string which isn't missing crucial date related text, indicating things like the century or timezone. Also, the $dateFromString operator contains no format specifiers to indicate that the text 'NOV' maps to the 11th month of a year in a specific spoken language.

Therefore, armed with the extra context of knowing this exported data refers to dates in the 21st century (the '2000s') with a UTC 'time zone' and in the English language (only inferred by asking the owner of the data), I had to perform some additional string manipulation in the aggregation pipeline before using $dateFromString to generate a true and accurate date type. The rest of this blog post shows how I achieved this for date strings like '23-NOV-20 22.57.36.827000000'.

Converting Incomplete Date Strings to Date Types Example

In the Mongo Shell targeting a running MongoDB test database, run the following code to insert 12 sample 'payment' records, with example 'bad date string' fields for testing each month of a sample year.

use test;

db.rawpayments.insert([

{'account_id': '010101', 'pymntdate': '01-JAN-20 01.01.01.123000000', 'amount': 1.01},

{'account_id': '020202', 'pymntdate': '02-FEB-20 02.02.02.456000000', 'amount': 2.02},

{'account_id': '030303', 'pymntdate': '03-MAR-20 03.03.03.789000000', 'amount': 3.03},

{'account_id': '040404', 'pymntdate': '04-APR-20 04.04.04.012000000', 'amount': 4.04},

{'account_id': '050505', 'pymntdate': '05-MAY-20 05.05.05.345000000', 'amount': 5.05},

{'account_id': '060606', 'pymntdate': '06-JUN-20 06.06.06.678000000', 'amount': 6.06},

{'account_id': '070707', 'pymntdate': '07-JUL-20 07.07.07.901000000', 'amount': 7.07},

{'account_id': '080808', 'pymntdate': '08-AUG-20 08.08.08.234000000', 'amount': 8.08},

{'account_id': '090909', 'pymntdate': '09-SEP-20 09.09.09.567000000', 'amount': 9.09},

{'account_id': '101010', 'pymntdate': '10-OCT-20 10.10.10.890000000', 'amount': 10.10},

{'account_id': '111111', 'pymntdate': '11-NOV-20 11.11.11.111000000', 'amount': 11.11},

{'account_id': '121212', 'pymntdate': '12-DEC-20 12.12.12.999000000', 'amount': 12.12}

]);

Then execute the following Aggregation pipeline to copy the contents of the 'rawpayments' collection, populated above, into a new collection named 'payments', but with the 'pymntdate' field values converted from string types to date types.

db.rawpayments.aggregate([
{$set: {
pymntdate: {
$dateFromString: {format: '%d-%m-%Y %H.%M.%S.%L', dateString:
{$concat: [
{$substrCP: ['$pymntdate', 0, 3]}, // USE FIRST 3 CHARS IN DATE STRING
{$switch: {branches: [ // REPLACE MONTH 3 CHARS IN DATE STRING WITH 2 DIGIT MONTH
{case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'JAN']}, then: '01'},
{case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'FEB']}, then: '02'},
{case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'MAR']}, then: '03'},
{case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'APR']}, then: '04'},
{case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'MAY']}, then: '05'},
{case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'JUN']}, then: '06'},
{case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'JUL']}, then: '07'},
{case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'AUG']}, then: '08'},
{case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'SEP']}, then: '09'},
{case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'OCT']}, then: '10'},
{case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'NOV']}, then: '11'},
{case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'DEC']}, then: '12'},
], default: 'ERROR'}},
'-20', // ADD HYPEHN + HARDCODED CENTURY 2 DIGITS
{$substrCP: ['$pymntdate', 7, 15]} // USE REMAINING PART OF DATE STRING UP UNTIL THE 3 MILLISECOND DIGITS (IGNORE REMAINING 6 NANOSECOND CHARS)
]
}}
},
}},
{$out: 'payments'}
]);

In this pipeline, the string '23-NOV-20 22.57.36.827000000' will be converted to 'ISODate("2020-11-23T22:57:36.827Z")' by concatenating the following four elements of text together before passing it to the $dateFromString operator to convert to a date:

'23-' (from the input string)
'11' (replacing 'NOV')
'-20' (hard-coded hyphen + century)
'20 22.57.36.827' (the rest of input string apart from last 6 nanosecond digits)

Note: A $set stage is used in this pipeline, which is a type of stage first introduced in MongoDB 4.2. $set is an alias for $addFields, so if using an earlier version of MongoDB, replace $set with $addFields in the pipeline.

To see what the converted records look like, containing new date types, query the new collection:

db.payments.find({}, {_id:0});

Which will show the following results:

{ "account_id" : "010101", "pymntdate" : ISODate("2020-01-01T01:01:01.123Z"), "amount" : 1.01 }

{ "account_id" : "020202", "pymntdate" : ISODate("2020-02-02T02:02:02.456Z"), "amount" : 2.02 }

{ "account_id" : "030303", "pymntdate" : ISODate("2020-03-03T03:03:03.789Z"), "amount" : 3.03 }

{ "account_id" : "040404", "pymntdate" : ISODate("2020-04-04T04:04:04.012Z"), "amount" : 4.04 }

{ "account_id" : "050505", "pymntdate" : ISODate("2020-05-05T05:05:05.345Z"), "amount" : 5.05 }

{ "account_id" : "060606", "pymntdate" : ISODate("2020-06-06T06:06:06.678Z"), "amount" : 6.06 }

{ "account_id" : "070707", "pymntdate" : ISODate("2020-07-07T07:07:07.901Z"), "amount" : 7.07 }

{ "account_id" : "080808", "pymntdate" : ISODate("2020-08-08T08:08:08.234Z"), "amount" : 8.08 }

{ "account_id" : "090909", "pymntdate" : ISODate("2020-09-09T09:09:09.567Z"), "amount" : 9.09 }

{ "account_id" : "101010", "pymntdate" : ISODate("2020-10-10T10:10:10.890Z"), "amount" : 10.1 }

{ "account_id" : "111111", "pymntdate" : ISODate("2020-11-11T11:11:11.111Z"), "amount" : 11.11 }

{ "account_id" : "121212", "pymntdate" : ISODate("2020-12-12T12:12:12.999Z"), "amount" : 12.12 }

Song for today: For Everything by The Murder Capital

Running MongoDB on ChromeOS (via Crostini)

2019-12-29T23:14:00.003+00:00

In my previous post I explored Linux application support in ChromeOS and Chromebooks (a.k.a. Crostini). Of course I was bound to try running MongoDB in this environment, which I found to work really well (for development purposes). Here's my notes on running a MongoDB database and tools on a Chromebook with Linux (beta) enabled:

In ChromeOS, launch the Terminal app (which opens a Shell inside the 'Penguin' Linux container inside the 'Termina' Linux VM)
Run the following commands which are documented in the MongoDB Manual page on installing MongoDB Enterprise on Debian (following the manual's tab instructions titled “Debian 9 "Stretch”):

wget -qO - https://www.mongodb.org/static/pgp/server-4.2.asc | sudo apt-key add -
echo "deb http://repo.mongodb.com/apt/debian stretch/mongodb-enterprise/4.2 main" | sudo tee /etc/apt/sources.list.d/mongodb-enterprise.list
sudo apt-get update
sudo apt-get install -y mongodb-enterprise

Start a MongoDB database instance running:

mkdir ~/data
mongod --dbpath ~/data

Launch a second Terminal window and then run the Mongo Shell against this database and perform a quick database insert and query test:

mongo
db.mycoll.insert({a:1})
db.mycoll.find()
db.mycoll.drop()
exit

Install Python 3 and the PIP Python package manager (using Anaconda) and then install the MongoDB Python driver (PyMongo):

wget https://repo.anaconda.com/archive/Anaconda3-2019.10-Linux-x86_64.sh

bash Anaconda3-*-Linux-x86_64.sh

source ~/.bashrc

python --version

pip --version

pip install --user pymongo

Test PyMongo by running a small ‘payments data generator’ Python script pulled down from a GitHub repository (this should insert records into the MongoDB local database’s “fs.payments” collection; after letting it run for a minute, continuously inserting new records, press Ctrl-C to stop it):

git clone https://github.com/pkdone/PaymentsWriteReadConcerns.git

cd PaymentsWriteReadConcerns/

./payments-records-loader.py -p 1

Download MongoDB Compass (use the Ubuntu 64-bit 14.04+ version), install and run it against the 'localhost' MongoDB database and inspect the contents of the “fs.payments” collection:

wget https://downloads.mongodb.com/compass/mongodb-compass_1.20.4_amd64.deb

sudo apt install ./mongodb-compass_*_amd64.deb

mongodb-compass

Song for today: Sun. Tears. Red by Jambinai

My Notes on Linux Application Support in ChromeOS (a.k.a. Crostini)

2019-12-29T21:56:00.003+00:00

These are my own rough notes from spending a few days studying Chrome OS and its Linux app support on a HP Chromebook 14* I got for free (retails for about £150) when I recently purchased a Google Pixel 4 Android mobile phone. I thought I’d share the notes in case they are of use to others. I’m sure there needs to be some corrections, so feedback is welcome.

* released: 2019, model: db0003na, codename: careena, board: grunt

Some references to other articles that I used to bootstrap my knowledge:

Running Custom Containers Under Chrome OS from the Chromium OS Docs
A useful set of answers provided to a query on the Chromebook Community Help site
An article called A closer look at Chrome OS using LXD to run Linux GUI apps (Project Crostini)

Below are some screenshots showing the ChromeOS Settings section where “Linux (beta)” (a.k.a. Crostini) can be enabled and the Linux apps that are then installed by default when (essentially just the GNOME Help application and the Terminal application, from which many other Linux apps can subsequently be installed):

Here is a diagram I put together to attempt to capture the architecture of Crostini in ChromeOS as I understand it (the rest of this document digs into the details behind some of these layers):

ChromeOS & Crostini

Under the covers, ChromeOS is based on Gentoo and the Portage package manager
crosh (ChromeOS Developer Shell) is the pluggable command line shell/terminal for ChromeOS (in the Chrome browser, enter Ctrl-Alt-T to launch crosh inside a browser tab)
Crostini is the term for Linux application support in ChromeOS which manages the specific Linux VM and then the specific Linux container inside it, managing the lifecycle of when to launch them, mounting the filesystem to show the container’s files in the ChromeOS Files app, etc.. Crostini provides easy to use Linux application support integrated directly into the running ChromeOS desktop, rather than, for example, needing to dual boot or having to run a separate Linux VM and needing to explicitly switch, via the desktop, between ChromeOS and the Linux VM.
ChromeOS also has a Developer mode (verification is disabled when the OS boots) which is a special mode built into all Chromebooks to allow users and developers to access the code behind the Chrome Operating System and load their own builds of ChromeOS. This mode also allows users to install and run another Linux system like Ubuntu instead of ChromeOS (i.e. dual boot), but still have ChromeOS available to boot into too
As an alternative to Crostini, in addition to the dual-boot option, developer mode can also be used for Crouton which is a set of scripts that bundle up a chroot generator/environment to run both ChromeOS and Ubuntu at the same time. Here a Linux OS runs alongside ChromeOS, so users can switch between the ChromeOS desktop and Linux desktops via a keyboard shortcut. This gives users the ability to take advantage of both environments without needing to reboot. Unlike with virtualisation, a second OS is not being booted and instead the guest OS is running using the Chromium OS system. As a result any performance penalty is reduced because everything is run natively, and RAM is not being wasted to boot two OSes at the same time. Note, Crostini is different than this Crouton capability, as it enables the Linux shell and apps to be brought into the platform in verified (non-developer) mode with seamless user interface desktop integration and multi-layered security, in a supported way.
To use Crostini, from the ChromeOS Settings select ‘Linux (Beta)’ and choose to enable it, which, behind the scenes, will download and configure a specific Linux VM containing a specific Linux Container (see the next sections for more details) and it adds a launcher group to the ChromeOS desktop called ‘Linux Apps’. This launcher group includes a launcher to run a Linux shell/terminal application, called Terminal, which is displayed in the ChromeOS desktop but is connected directly inside the container

Crostini Linux VM Layer

crosvm (ChromeOS Virtual Machine Monitor) is a custom virtual machine manager written in Rust that runs guest VMs via Linux's KVM hypervisor virtualisation layer and manages the low-level virtual I/O device communication (Amazon’s Firecracker is a fork of crosvm)
A specific VM is used to run a container rather than ChromeOS running a container directly, for security reasons because containers do not provide sufficient security isolation on their own. With the two layers, an adversary has to exploit crosvm via its limited interactions with the guest, in addition to the container, and the VM itself is heavily sandboxed.
The VM (and its container) are tied to a ChromeOS login session and as soon as a user logs out, all programs are shut down/killed by design (all user data lives in the user’s encrypted home to ensure nothing is leaked when a user log out). The VM, container and their data are persisted across user sessions and are kept in the same per-user encrypted storage as the rest of the browser's data.
KVM generally (rather than Crostini specifically) can execute multiple virtual machines running unmodified Linux or Windows images. Each virtual machine has private virtualised hardware: a network card, disk, graphics adapter, etc. The kernel component of KVM is included in mainline Linux codebase and the userspace component of KVM is included in mainline QEMU codebase
Termina is the VM launched by crosvm and is based on a ChromeOS (CrOS) image with a stripped-down ChromeOS Linux kernel and userland tools. The main goal is to just boot up Termina as quickly as possible, as a secure sandbox, and start running containers.
Currently, other custom VMs (other Linux variants, Windows, etc) cannot be run and only instances of the Termina VM image can be booted, although multiple VM instances can be run simultaneously based on the Termina image
vmc is the crosh command line utility to manually manage custom VM instances via Concierge (the ChromeOS daemon that manages VM/container life cycles)
To view the registered VM(s) from crosh (Ctrl-Alt-T), which may or may not be running, run:

vmc list

To launch the Termina VM as a VM instance called ‘termina’ and open a shell directly in the VM, run:

vmc start termina

With the above command, the default container in the VM will not be started automatically. However, instead, if from the ChromeOS desktop, a Linux Shell (Terminal) or other Linux App is launched (or the ‘Linux files’ app, Files , is launched) the Termina VM is automatically launched and the default container it owns is also automatically started
If the Termina VM is already running, to connect to it via a shell, run:

vsh termina

If the ‘vmc start’ command is run with a different VM name, a new VM of that name will be created, launched and its shell entered from the existing terminal command line. This will use the same Termina image, and when running, ‘vmc list’ with list both VMs (the new instance doesn’t have any containers defined in it by default, ready to run, unlike the main Termina VM)
To stop the main Termina VM, run:

vmc stop termina

Crostini Container Layer

The Termina VM only supports running containers using the “Linux Containers” (LXC) technology at the moment and doesn’t support Docker or other container technologies
The default container instance launched via Termina is called Penguin and is based on Debian 9 with some custom packages
Containers are run inside a VM rather than programs running directly in the VM to help keep VM startup times low, to help improve security sandboxing by providing a stateless immutable VM image and to allow the container, its applications and their dependencies to be maintained independently from the VM, which otherwise may have contradicting dependecy requirements
LXC, generally, works in the vanilla Linux kernel requiring no additional patches to be applied to the kernel source and uses various kernel features to contain processes including kernel namespaces (ipc, uts, mount, pid, network and user), Apparmor and SELinux profiles, Seccomp policies, chroots (using pivot_root), CGroups (control groups). LXCFS provides the userspace (FUSE) filesystem providing overlay files for cpuinfo, meminfo, stat and uptime plus a cgroupfs compatible tree allowing unprivileged writes.
LXD is a higher level container framework, which Crostini uses and LXD uses its own specific image formats and also provides the ability to manage containers remotely. Although LXD uses LXC under the covers, it is based on more than just LXC. The Termina VM is configured to run the LXD daemon. Confusingly, the command line tool for controlling LXD is called ‘lxc’ (the ‘LXD Client). If users are using LXD commands to manage containers, they should avoid using any commands that start with ‘lxc-’ as these are lower level LXC commands. Users should avoid mixing and matching the use of both sets of commands in the same system. Crostini uses LXD to launch the Penguin container and LXD is configured to only allow unprivileged containers to be run, for added security. Therefore with Crostini, users should not use the lower level ‘lxc-’ commands because these can’t manage the LXD derived containers that Crostini uses. By default, LXD comes with 3 remote repositories providing images: 1) ubuntu: (for stable Ubuntu images), 2) ubuntu-daily: (for daily Ubuntu images), and 3) images: (for other distros)
In the Termina VM, the full LXC/LXD capabilities are provided, and remote images for many types of distros can be used to spawn multiple containers, in addition to the main Penguin container (these are not tested or certified though so may or may not work correctly)
Sommelier (a Wayland proxy compositor provides seamless X forwarding integration for content, input events, clipboard data, etc... between Linux apps and the ChromeOS desktop) and Garcon (a daemon for passing requests between the container and ChromeOS) binaries are bind-mounted into the main Penguin container. The Penguin container’s systemd is automatically configured to start these daemons. The libraries for these daemons are already present in the Penguin container LXD image used for Penguin (‘google:debian/stretch’). Other LXD containers launched in the VM don't seem to be enabled for their X based GUI apps to be displayed in the ChromeOS desktop, even if they use the special ‘google:debian/stretch’ LXD container image as it seems Crostini won’t attempt to integrate with this at runtime. Note: Some online articles imply it may be possible to get X-forwarding working from multiple containers.
In the Penguin container (which users can access directly, via the Terminal app launcher in the ChomeOS desktop), users can query the IP address of the container which is accessible from ChromeOS and can then run crosh (Ctrl-Alt-T) in ChromeOS and ping the IP address of the container directly. Users can also SSH from the ChromeOS desktop to the Penguin container using Google’s official SSH client that can be installed in Chrome via Chrome Web Store
If other containers are launched and then Google’s official SSH client is installed in ChromeOS (install ‘Secure Shell Extension’ via the Chrome Web Store), users can then define SFTP mount-points to other non-Penguin containers and the files in these containers will automatically appear in the Files app too
From the Termina VM, users can use the standard LXD lxc command line tool to list containers and then to see if the Penguin container is running, by running:

lxc list
lxc info penguin | grep "Status: "

To check the logs for the Penguin container, run:

lxc info --show-log penguin

To open a command line shell as root in the running container (note, the Terminal app has a different identity for connecting to the Penguin container, which is a non-root user), run

lxc exec penguin -- /bin/bash

Within the Penguin container you can run GUI apps which automatically display in the main ChromeOS user interface. For example to install the GEdit text editor Linux application run the following (which also adds a launcher for GEdit in the ChromeOS desktop ‘Linux Apps’ launcher group):

sudo apt install gedit

It is even possible to install and run a new Google Chrome browser installation from the Linux container, by running the following (which also adds a launcher for this Linux version of Chrome in the ChromeOS desktop ‘Linux Apps’ launcher group):

wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
sudo apt install ./google-chrome-stable_current_amd64.deb

From crosh (Ctrl-Alt-T), it is also possible to start the main container in the main VM (if not already started) and then connect a shell directly to the main container in the main VM, by running

vmc container termina penguin
vsh termina penguin

Playing with Custom Containers

First of all launch crosh (Ctrl-Alt-T), and connect a shell to the Termina VM:

vsh termina

Import Google’s own image repository into LXD to include the special Debian image used by Penguin:

lxc remote list
lxc remote add google https://storage.googleapis.com/cros-containers --protocol=simplestreams
lxc remote list
lxc image list google:
lxc image info google:debian/stretch

Launch and test a container using Google’s special Debian 9 image:

lxc launch google:debian/stretch mycrosdebiancontainer
lxc list
lxc exec mycrosdebiancontainer -- /bin/bash
cat /etc/*elease*
apt update && apt upgrade -y
exit

Launch and test a container using a standard Ubuntu 18.04 image:

lxc launch ubuntu:18.04 myubuntucontainer
lxc list
lxc exec myubuntucontainer -- /bin/bash
cat /etc/*elease*
apt update && apt upgrade -y
exit

Launch and test a container using a standard Centos 7 image:

lxc launch images:centos/7 mycentoscontainer
lxc list
lxc exec mycentoscontainer -- /bin/bash
cat /etc/*elease*
yum -y update
exit

If the Chromebook is rebooted and the Termina VM restarted, these 3 containers still exist as they are persisted, but they will be in a stopped state. When the containers are then manually restarted they will still have the same settings, files and modifications that were made before they were stopped. To start a stopped container run (example shown for one of the containers):

lxc start myubuntucontainer

None of the containers launched above seem to enable GUI apps (e.g. GEdit) to be forwarded automatically to the ChromeOS desktop. Even though the ‘google:debian/stretch’ based container has the relevant X forwarding libraries bundled, it doesn't seem to be automatically integrated with at runtime by the Crostini framework to enable X forwarding
Another way to launch a new container is to use one of the following commands, although, again, neither seem to automatically configure X-forwarding, even though they use the ‘google:debian/stretch’ image. It seems that only the Penguin container specifically is beiung managed by Crostini and has X forwarding configured (the first command below should be launched from ChromeOS crosh, the second command which is deprecated performs the same action but should be run from inside the Termina VM:

vmc container termina mycontainer
run_container.sh --container_name=mycontainer --user=jdoe --shell

Note, this may throw a timeout error similar to below, but the containers do seem to be created ok:

Error: routine at frontends/vmc.rs:397 `container_create(vm_name,user_id_hash,container_name,image_server,image_alias)` failed: timeout while waiting for signal

Song for today: The Desert Song, No.2 - live by Sophia

Some Tips for Diagnosing Client Connection Issues for MongoDB Atlas

2019-12-19T14:50:00.007+00:00

Introduction

[UPDATE 07-Sep-2020: I've now written an executable binary tool you can run which performs the equivalent of the checks in this blog post to diagnose connectivity issues to Atas or any other type of MongoDB deployment, downloadable from here]

By default, for recent MongoDB drivers and client tools, MongoDB Atlas advertises the exposed URL for a deployed database cluster using a service name which maps to a set of DNS SRV records to provide an initial connection seed list. This results in a much more 'human digestible' URL, but more importantly, increases deployment flexibility and the ability for underlying database server hosts to migrate over time, without needing to subsequently reconfigure clients.

For example, an Atlas Cluster may be referenced in a connection string by:

testcluster-abcd.mongodb.net

...as an alternative to the full connection endpoint list:

testcluster-shard-00-00-abcd.mongodb.net:27017,testcluster-shard-00-01-abcd.mongodb.net:27017,testcluster-shard-00-02-abcd.mongodb.net:27017/test?replicaSet=TestCluster-shard-0

It is worth noting though, whichever approach is used (explicitly defining all endpoints in the connection string or having it discovered via the DNS SRV service name), the connection URL seed list is only ever used for bootstrapping a client application to the database cluster, when the client first starts or when it later needs to restart. On start-up, the client uses the connection seed list to attempt to attach to any member of the cluster, and in fact, all but one of the endpoints could be incorrect and a successful cluster connection will still be achieved. Once the initial connection is made, the true cluster member endpoint list is dynamically and continuously shared between the cluster and the client at runtime. This enables the client to continue operating against the database even if the members of the database cluster change locations or identities over time. For example, after a year of a database cluster and application continuously running, there could be the need to increase database capacity by dynamically rotating the database hosts to new higher processing capacity machines. This all happens dynamically and the already running client application automatically becomes aware and leverages the new hosts without downtime and without needing to consult the connection string again. If the client application restarts though, it will need to read the updated connection string to be able to bootstrap a connection back up to the database cluster.

In the rest of this post we will explore some of the ways initial client connectivity issues can be diagnosed and resolved when using DNS SRV based connection URLs. For reference, Joe Drumgoole provides a great explanation about how DNS SRV records work more generally, and how MongoDB drivers and tools can leverage these.

Naive Connectivity Diagnosis

If you are having connection problems with Atlas when using the SRV service name based URL, be weary of drawing the wrong conclusions regarding the cause of the connection problem...

For example, lets say you can't connect an application to a cluster with the Atlas advertised URL of 'mongodb+srv://testcluster-abcd.mongodb.net' from your laptop. You may be tempted to try to debug the connection problem by running some of the following commands from your laptop:

$ ping testcluster-abcd.mongodb.net
ping: testcluster-abcd.mongodb.net: Name or service not known

$ nc -zv -w 5 testcluster-abcd.mongodb.net 27017
nc: getaddrinfo for host "testcluster-abcd.mongodb.net" port 27017: Name or service not known

Neither of these work even if you actually do have Atlas connectivity configured correctly. This is because "testcluster-abcd.mongodb.net" is not the DNS name of a specific host endpoint. It is actually used by the MongoDB drivers and tools to dynamically lookup the DNS SRV records which have been populated for a service called 'testcluster-abcd.mongodb.net'.

Useful Connectivity Diagnosis

As documented in the MongoDB Drivers specification document and the MongoDB Manual, a DNS SRV query is performed by the drivers/tools by prepending the text '_mongodb._tcp.' to the service name. Therefore, to lookup the list of real endpoints for the Atlas cluster from your laptop using the DNS nslookup tool, you should run:

$ nslookup -q=SRV _mongodb._tcp.testcluster-abcd.mongodb.net
Server: 127.0.0.53
Address: 127.0.0.53#53

Non-authoritative answer:
_mongodb._tcp.testcluster-abcd.mongodb.net service = 0 0 27017 testcluster-shard-00-02-abcd.mongodb.net.
_mongodb._tcp.testcluster-abcd.mongodb.net service = 0 0 27017 testcluster-shard-00-01-abcd.mongodb.net.
_mongodb._tcp.testcluster-abcd.mongodb.net service = 0 0 27017 testcluster-shard-00-00-abcd.mongodb.net.

You can see that in this case that the database service name maps to 3 endpoints (i.e. the hosts of the 3 replica set members). You can then lookup the actual IP address of any one of these endpoints if you desire:

$ nslookup testcluster-shard-00-00-abcd.mongodb.net
Server: 127.0.0.53
Address: 127.0.0.53#53

Non-authoritative answer:
testcluster-shard-00-00-abcd.mongodb.net canonical name = ec2-35-178-15-240.eu-west-2.compute.amazonaws.com.
Name: ec2-35-178-15-240.eu-west-2.compute.amazonaws.com
Address: 35.178.14.238

So to now debug your connectivity issue further you can use ping but this time by specifying one of the underlying host server endpoints for the database cluster:

$ ping -c 3 testcluster-shard-00-00-abcd.mongodb.net
PING ec2-35-178-15-240.eu-west-2.compute.amazonaws.com (35.178.14.238) 56(84) bytes of data.
64 bytes from ec2-35-178-15-240.eu-west-2.compute.amazonaws.com (35.178.14.238): icmp_seq=1 ttl=51 time=10.2 ms
64 bytes from ec2-35-178-15-240.eu-west-2.compute.amazonaws.com (35.178.14.238): icmp_seq=2 ttl=51 time=9.73 ms
64 bytes from ec2-35-178-15-240.eu-west-2.compute.amazonaws.com (35.178.14.238): icmp_seq=3 ttl=51 time=11.7 ms

--- ec2-35-178-15-240.eu-west-2.compute.amazonaws.com ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2002ms
rtt min/avg/max/mdev = 9.739/10.586/11.735/0.850 ms

If this is successful it still doesn't necessarily mean that you can connect to the database service. The next thing to try is to see if you can actually open a socket connection to the mongod (or mongos) daemon process running on one of the endpoints, which you can achieve from your laptop using the netcat utility:

$ nc -zv -w 5 testcluster-shard-00-00-abcd.mongodb.net 27017
nc: connect to testcluster-shard-00-00-abcd.mongodb.net port 27017 (tcp) timed out: Operation now in progress

If this doesn't connect but you are able to ping the endpoint host (as is the case in this example), it probably indicates that the IP address of your client laptop has not been added to the Atlas project's access list, which is easy to remedy via the Atlas Console:

Once your laptop has been added to the access list, running netcat again should demonstrate that a socket connection can now be successfully made:

$ nc -zv -w 5 testcluster-shard-00-00-abcd.mongodb.net 27017
Connection to testcluster-shard-00-00-abcd.mongodb.net 27017 port [tcp/*] succeeded!

If this connects, then it is advisable to move on to trying to connect to the database via the Mongo Shell.

In this example screenshot, the Atlas console suggests the following Mongo Shell command line to use to connect:

mongo "mongodb+srv://testcluster-abcd.mongodb.net/test" --username main_user

With this connection string, some of you may be thinking how does the Shell know to connect to Atlas over SSL/TLS, what replica-set name it should request and what authentication source database it should specify to locate the user's credentials?

Well, in addition to querying the DNS SRV records for the service, when dynamically constructing the initial bootstrap URL for the cluster, the MongoDB drivers/tools also lookup a DNS TXT record for the service which Atlas also populates for the deployed cluster. This TXT record contains the set of connection options, to be added as parameters to the dynamically constructed connecting string (e.g. 'ssl=true&replicaSet=TestCluster-shard-0&authSource=admin'). You can view what these parameter settings are for a particular Atlas cluster, yourself, by running the following DNS query:

$ nslookup -q=TXT testcluster-abcd.mongodb.net
Server: 127.0.0.53
Address: 127.0.0.53#53

Non-authoritative answer:
testcluster-abcd.mongodb.net text = "authSource=admin&replicaSet=TestCluster-shard-0"

Note, the default behaviour for MongoDB drivers/tools using a 'mongodb+srv' based URL is defined as to enable SSL/TLS for the connection. As a result, 'ssl=true' doesn't have to be included in the DNS TXT record, as shown in the example above, because the drivers/tools will automatically add this parameter to the connection string on the fly.

Summary

There's other potential causes of MongoDB Atlas connectivity issues that aren't covered in this post, but hopefully the tips highlighted here will help some of you, especially if you are diagnosing problems when using DNS SRV based service names in the connection URLs you use.

Song for today: Lose the Baby by Tropical Fuck Storm

Running a Mongo Shell Script From Within A Larger Bash Script

2019-05-11T21:30:00.003+01:00

[EDIT May 2023: The post below was written for the legacy 'mongo' shell but has since been tested with the modern 'mongosh' shell, which behaves the same with no issues.]

If you have a Bash script that amongst other things needs to execute a set of multiple Mongo Shell commands together, there are a number of approaches that can be taken. This blog post contains nothing revelatory, but hopefully at least captures examples of these approaches in one single place for easy future reference. There are many situations where this is required, for example:

From within a Docker container image’s Entrypoint, running a Bash script which includes a section of Mongo Shell JavaScript code to configure a MongoDB replica-set, using rs.initiate() and associated commands.
From within a Continuous Integration process, running a Bash script which installs a MongoDB environment in a host Operating System (OS) and then populates the new MongoDB database with some sample data, using a set of Mongo Shell CRUD commands
From within a host system’s monitoring Bash script, which, in addition to gathering some host OS metrics, invokes a set of MongoDB’s server status and statistics commands to also capture database metrics.

The rest of this blog post shows some of the different approaches that can be taken to execute a block of Mongo Shell JavaScript code from within a larger Bash script. In these specific examples a trivial block of JavaScript code will insert 2 records into a ‘persons’ database collection, then query and print both the records belonging to the collection and then remove the 2 records from the collection.

It is worth noting that there is a difference in some of Mongo Shell’s behaviour when running a block of JavaScript code in the Mongo Shell’s Scripted mode rather than its Interactive mode, including the inability to run the Shell Helper commands (e.g. unable to utilise use db, show collections, etc.).

1. EXTERNAL SCRIPT FILE

This option requires executing a separate file which contains the block of JavaScript code. First create a new JavaScript file called test.js with the following content:

db = db.getSiblingDB('testdb');

db.persons.insertOne({'firstname': 'Sarah', 'lastname': 'Smith'});

db.persons.insertOne({'firstname': 'John', 'lastname': 'Jones'});

db.persons.find({}, {'_id': 0, 'firstname': 1}).forEach(printjson);

print(db.persons.remove({}));

Then create, make executable, and run a new Bash .sh script file with the following content (this will run the Mongo Shell in Scripted mode):

#!/bin/bash

echo "Doing some Bash script work first"

mongo --quiet ./test.js

echo "Doing some more Bash script work afterwards"

2. SINGLE-LINE EVAL SCRIPT

This option involves executing the Mongo Shell with its eval option, passing in a single line containing each of the JavaScript commands separated by a semicolon. Create, make executable, and run a new Bash .sh script file with the following content (this will run the Mongo Shell in Scripted mode):

#!/bin/bash

echo "Doing some Bash script work first"

mongo --quiet --eval "db = db.getSiblingDB('testdb'); db.persons.insertOne({'firstname': 'Sarah', 'lastname': 'Smith'}); db.persons.insertOne({'firstname': 'John', 'lastname': 'Jones'}); db.persons.find({}, {'_id': 0, 'firstname': 1}).forEach(printjson); print(db.persons.remove({}));"

echo "Doing some more Bash script work afterwards"

Note: Depending on your desktop resolution, your browser may show the Mongo Shell command wrapping onto multiple lines. However, it is actually just a single line, which can be proved by copying the line into a text editor which has its ‘text wrapping’ feature disabled.

3. MULTI-LINE EVAL SCRIPT

This option involves executing the Mongo Shell with its eval option, passing in a block of multiple lines of JavaScript code, where the start and end of the code block are delimited by single or double quotes. Create, make executable, and run a new Bash .sh script file with the following content (this will run the Mongo Shell in Scripted mode):

#!/bin/bash

echo "Doing some Bash script work first"

mongo --quiet --eval "

db = db.getSiblingDB('testdb');

db.persons.insertOne({'firstname': 'Sarah', 'lastname': 'Smith'});

db.persons.insertOne({'firstname': 'John', 'lastname': 'Jones'});

db.persons.find({}, {'_id': 0, 'firstname': 1}).forEach(printjson);

print(db.persons.remove({}));

echo "Doing some more Bash script work afterwards"

Note: Care has to be taken to ensure that any quotes used within the JavaScript code block are single-quotes, if the Mongo Shell’s eval delimiters are double-quotes, or vice versa.

4. MULTI-LINE SCRIPT WITH HERE-DOC

This option involves redirecting the content of a block of JavaScript multi-line code into the standard input (‘stdin’) stream of the Mongo Shell program, using a Bash Here-Document. Create, make executable, and run a new Bash .sh script file with the following content (unlike the other approaches this will run the Mongo Shell in Interactive mode):

#!/bin/bash

echo "Doing some Bash script work first"

mongo --quiet <<EOF

show dbs;

db = db.getSiblingDB("testdb");

db.persons.insertOne({'firstname': 'Sarah', 'lastname': 'Smith'});

db.persons.insertOne({'firstname': 'John', 'lastname': 'Jones'});

db.persons.find({}, {'_id': 0, 'firstname': 1}).forEach(printjson);

print(db.persons.remove({}));

EOF

echo "Doing some more Bash script work afterwards"

In this case, because the Mongo Shell is run in Interactive mode, the output of the script will be more verbose. Also, by virtue of running in Interactive mode, the Shell Helpers commands can now be used within the JavaScript code. The block of code above contains the additional line show dbs; as the first line, to illustrate this. However, don’t take this example as a recommendation to use Shell Helpers in your scripts. Generally you should avoid using Shell Helpers in any of your Mongo Shell scripts, regardless of which approach you use.

Also, because the Mongo Shell eval option is not being used, the JavaScript code can contain a mix of both single and double quotes, as illustrated by the modified line of code db = db.getSiblingDB("testdb"); shown above, which utilises double-quotes.

Another Observation

It is worth noting that for all of these four methods, apart from the External Script File method, you can reference Bash environment variables inline within the Mongo Shell JavaScript code (as long as double-quotes deliminate the code for the eval methods, rather than single-quotes). For example, from a Bash terminal if you have set a variable with the name of the database to write to...

export DBNAME=testdb

... you can then use the value of this environment variable from within the inline Mongo Shell JavaScript...

db = db.getSiblingDB('${DBNAME}');

...to factor out the database name. At face value this may not seem particularly powerful until you realise that many build frameworks (e.g. Docker Compose, Ansible, etc.) allow you to declare environment variables within configuration settings before invoking Bash scripts, to factor out environment specific settings.

One bit of caution though, if you are using the MongoDB query operators, they include an ampersand in the syntax (e.g. '&gt', '&exists') which will need to be escaped in these scripts (e.g. '\&gt', '\&exists'). Otherwise Bash will treat each ampersand as a special control character which, in this case, will likely result in being replaced with some empty text.

Summary

The following table summarises the main differences between the four approaches to running a JavaScript block of code with the Mongo Shell, from within a larger Bash script:

Song for today: D. Feathers by Bettie Serveert

MongoDB Graph Query Example, Inspired by Designing Data-Intensive Applications Book

2018-04-13T12:15:00.002+01:00

Introduction

People who have worked with me recently are probably bored by me raving about how good this book is: Designing Data-Intensive Applications by Martin Kleppmann (O'Reilly, 2016). Suffice to say, if you are in IT and have any sort of interest in databases and/or data-driven applications, you should read this book. You will be richly rewarded for the effort.

In the second chapter of the book ('Data Models and Query Languages'), Martin has a section called 'Graph Like Data Models' which explores 'graph use cases' where many-to-many relationships are typically modelled with tree-like structures, with indeterminate numbers of inter-connections. The book section shows how a specific 'graph problem' can be solved by using a dedicated graph database technology with associated query language (Cypher) and by using an ordinary relational database with associated query language (SQL). One thing that quickly becomes evident, when reading this section of the book, is how difficult it is in a relational database to model complex many-to-many relationships. This may come as a surprise to some people. However, this is consistent with something I've subconsciously learnt over 20 years of using relational databases, which is, relationships ≠ relations, in the world of RDBMS.

The graph scenario illustrated in the book shows an example of two people, Lucy and Alain, who are married to each other, who are born in different places and who now live together in a third place. For clarity, I've included the diagram from the book, below, to best illustrate the scenario (annotated with the book's details, in red, for reference).

Throughout the book, numerous types of databases and data-stores are illustrated, compared and contrasted, including MongoDB in many places. However the book's section on graph models doesn't show how MongoDB can be used to solve the example graph scenario. Therefore, I thought I take this task on myself. Essentially, the premise is that there is a data-set of many people, with data on the place each person was born in and the place each person now lives in. Of course, any given place may be within a larger named place, which may in turn be within a larger named place, and so on, as illustrated in the diagram above. In the rest of this blog post I show one way that such data structures and relationships can be modelled in MongoDB and then leveraged by MongoDB's graph query capabilities (specifically using the graph lookup feature of MongoDB's Aggregation Framework). What will be demonstrated is how to efficiently answer the exam question posed by the book, namely: 'Find People Who Emigrated From US To Europe'.

Solving The Book's Graph Challenge With MongoDB

To demonstrate the use of MongoDB's Aggregation 'graph lookup' capability to answer the question 'Find People Who Emigrated From US To Europe', I've created the following two MongoDB collections, populated with data:

'persons' collection. Contains around one million randomly generated person records, where each person has 'born_in' and 'lives_in' attributes, which each reference a 'starting' place record in the places collection.
'places' collection. Contains hierarchical geographical places data, with the graph structure of: SUBDIVISIONS-->COUNTRIES-->SUBREGIONS-->CONTINENTS. Note: The granularity and hierarchy of the data-set is slightly different than illustrated in the book, due to the sources of geographical data I had available to cobble together.

Similar to the book's example, amongst the many 'persons' records stored in MongoDB data-set, are the following two records relating to 'Lucy' and 'Alain'.

{fullname: 'Lucy Smith', born_in: 'Idaho', lives_in: 'England'}
{fullname: 'Alain Chirac', born_in: 'Bourgogne-Franche-Comte', lives_in: 'England'}

Below is an excerpt of some of the records from the 'places' collection, which illustrates how a place record may refer to another place record, via its 'part_of' attribute.

{name: 'England', type: 'subdivision', part_of: 'United Kingdom of Great Britain and Northern Ireland'}
..
{name: 'United Kingdom of Great Britain and Northern Ireland', type: 'country', part_of: 'Northern Europe'}
..
{name: 'Northern Europe', type: 'subregion', part_of: 'Europe'}
..
{name: 'Europe', type: 'continent', part_of: ''}

If you want to access this data yourself and load it into the two MongoDB database collections, I've created JSON exports of both collections and made these available in a GitHub project (see the project's README for more details on how to load the data into MongoDB and then how to actually run the example's 'graph lookup' aggregation pipeline).

The MongoDB aggregation pipeline I created, to process the data across these two collections and to answer the question 'Find People Who Emigrated From US To Europe', has the following stages:

$graphLookup: For every record in the 'persons' collection, using the person's 'born_in' attribute, locate the matching record in the 'places' collection and then walk the chain of ancestor place records building up a hierarchy of 'born in' place names.
$match: Only keep 'persons' records, where the 'born in' hierarchy of discovered place names includes 'United States of America'.
$graphLookup: For each of these remaining 'persons' records, using each person's 'lives_in' attribute, locate the matching record in the 'places' collection and then walk the chain of ancestor place records building up a hierarchy of 'lives in' place names.
$match: Only keep around the remaining 'persons' records, where the 'lives in' hierarchy of discovered place names includes 'Europe'.
$project: For the resulting records to be returned, just show the attributes 'fullname', 'born_in' and 'lives_in'.

The actual MongoDB Aggregation Pipeline for this is:

db.persons.aggregate([
{$graphLookup: {
from: 'places',
startWith: '$born_in',
connectFromField: 'part_of',
connectToField: 'name',
as: 'born_hierarchy'
}},
{$match: {'born_hierarchy.name': born}},
{$graphLookup: {
from: 'places',
startWith: '$lives_in',
connectFromField: 'part_of',
connectToField: 'name',
as: 'lives_hierarchy'
}},
{$match: {'lives_hierarchy.name': lives}},
{$project: {
_id: 0,
fullname: 1,
born_in: 1,
lives_in: 1,
}}
])

When this aggregation is executed, after first declaring values for the variables highlighted in red...

var born = 'United States of America', lives = 'Europe'

...the following is an excerpt of the output that is returned by the aggregation:

{fullname: 'Lucy Smith', born_in: 'Idaho', lives_in: 'England'}
{fullname: 'Bobby Mc470', born_in: 'Illinois', lives_in: 'La Massana'}
{fullname: 'Sandy Mc1529', born_in: 'Mississippi', lives_in: 'Karbinci'}
{fullname: 'Mandy Mc2131', born_in: 'Tennessee', lives_in: 'Budapest'}
{fullname: 'Gordon Mc2472', born_in: 'Texas', lives_in: 'Tyumenskaya oblast'}
{fullname: 'Gertrude Mc2869', born_in: 'United States of America', lives_in: 'Planken'}
{fullname: 'Simon Mc3087', born_in: 'Indiana', lives_in: 'Ribnica'}
..
..

On my laptop, using the data-set of a million person records, the aggregation takes about 45 seconds to complete. However, if I first define the index...

db.places.createIndex({name: 1})

...and then run the aggregation, it only takes around 2 seconds to execute. This shows just how efficiently the 'graphLookup' capability is able to walk a graph of relationships, by leveraging an appropriate index.

Summary

I've shown the expressiveness and power of MongoDB's aggregation framework, combined with 'graphLookup' pipeline stages, to perform a query of a graph of relationships across many records. A 'graphLookup' stage is efficient as it avoids the need to develop client application logic to programmatically navigate each hop of a graph of relationships, and thus avoids the network round trip latency that a client, traversing each hop, would otherwise incur. The 'graphLookup' stage can and should leverage an index, to enable the 'tree-walk' process to be even more efficient.

Although MongoDB may not be as rich in terms of the number of graph processing primitives it provides, compared with 'dedicated' graph databases, it possesses some key advantages for 'graph' use cases:

Business Critical Applications. MongoDB is designed for, and invariably deployed as a realtime operational database, with built-in high availability and enterprise security capabilities to support realtime business critical uses. Dedicated graph databases tend to be built for 'back-office' and 'offline' analytical uses, with less focus on high availability and security. If there is a need to leverage a database to respond to graph queries in realtime for applications sensitive to latency, availability and security, MongoDB is likely to be a great fit.
Cost of Ownership & Timeliness of Insight. Often, there may be requirements to satisfy CRUD random realtime operations on individual data records and satisfy graph-related analysis of the data-set as a whole. Traditionally, this would require an ecosystem containing two types of database, an operational database and a graph analytical database. A set of ETL processes would then need to be developed to keep the duplicated data synchronised between the two databases. By combining both roles in a single MongoDB distributed database, with appropriate workload isolation, the financial cost of this complexity can be greatly reduced, due to a far simpler deployment. Additionally, and as a consequence, there will be no lag that arises when keeping one copy of data in one system, up to date with the other copy of the data in another system. Rather than operating on stale data, the graph analytical workloads operate on current data to provide more accurate business insight.

Song for today: Cosmonauts by Quicksand

Run MongoDB Aggregation Facets In Parallel For Faster Insight

2018-02-04T09:33:00.000+00:00

Introduction

MongoDB version 3.4 introduced a new Aggregation stage, $facet, to enable developers to "create multi-faceted aggregations which characterize data across multiple dimensions, or facets, within a single aggregation stage". For example, you may run a clothes retail website and use this aggregation capability to characterise the choices across a set of filtered products, by the following facets, simultaneously:

Size (e.g. S, M, L,)
Full-price vs On-offer
Brand (e.g. Nike, Adidas)
Average Rating (e.g. 1 - 5 stars)

In this blog post, I explore a way in which the response times for faceted aggregation workloads can be reduced, by leveraging parallel processing.

Parallelising Aggregated Facets

If an aggregation pipeline declares the use of the $facet stage, it defines multiple facets where each facet is a "sub-pipeline" containing a series actions specific to its facet. When a faceted aggregation is executed, the result of the aggregation will contain the combined output of all the facet's sub-pipelines. Below is an example of the structure of a "faceted" aggregation pipeline.

In this example, there are two facets or dimensions, each containing a sub-pipeline. Each sub-pipeline is essentially a regular aggregation pipeline, with just a small handful of restrictions on what it can contain. Notable amongst these restrictions is the fact that the sub-pipeline cannot contain a $facet stage. Therefore you can't use this to go infinite levels deep!

The ability to define an aggregation containing different facets is not just useful for responding to online user interactions, in realtime. It is also useful for activities such as running a business's "internal reporting" workloads, where a report may need to analyse a full data set, and then summarise the data in different dimensions.

A data set that I've been playing around with recently, is the publically available "MOT UK Annual Vehicle Test Result Data". An MOT is a UK annual safety check on a vehicle, and is mandatory for all cars over 3 years old. The UK government makes the data available to download, in anonymised form, for anyone to consume, through its data.gov.uk platform. It's a rich data set, providing a lot of insight into the characteristics of cars that UK residents have been driving over the last ten years or so. As a result, it's a good data set for me to use to explore faceted aggregations.

2014-2016 MOT car data loaded into MongoDB - displayed in MongoDB Compass

To analyse the car data, I created a GitHub project at mongo-uk-car-data. This contains some Python scripts to load the data from the MOT data CSV files into MongoDB, and to perform various analytics on the data set using MongoDB's Aggregation Framework. One of the Python scripts I created, mdb-mot-agg-cars-facets.py, uses a $facet stage to aggregate together summary information, in the following three different dimensions:

Analyse the different car makes/brands (e.g. Ford, Vauxhall) and categorise them into a range of "buckets", based on how many different unique models each car make has.
Summarise the amount of tested cars that fall into each fuel type category (e.g. Petrol, Diesel, Electric). Note: "Petrol" is equivalent to "Gas" for my American friends, I believe.
List the top 5 car makes/brands from the car tests, showing how many cars there are for each car make, plus each car make's most popular and least popular models.

The following shows the result of the aggregation when run against the data set for years 2014-16 (a data set of approximately 113 million records).

When I ran this test on my Linux Laptop (hosting both a mongod server and the test Python script), the aggregation was completed in about 5:20 minutes. In my test Python client code, a faceted pipeline is constructed, which uses the PyMongo Driver to send the aggregation command and pipeline payload to the MongoDB database. Significantly, the database's Aggregation framework processes each facet's sub-pipeline serially. Therefore, for example, if the first facet takes 5 minutes to process, the second takes 2 minutes and the third facet takes 10 minutes, the client application will only receive a full response in just over 17 minutes.

It occurred to me that there was a way to potentially speed up the execution time of this analytics job. At the point of invoking collection.aggregate(pipeline) in the mdb-mot-agg-cars-facets.py script, a custom function could be invoked instead, that internally breaks the pipeline up into separate pipelines, one for each facet. The function could then send each facet as a separate aggregation command, in parallel, to the MongoDB database to be processed, before merging the results into one, and returning it. Functionally, the behaviour of this code and the content of the response would be identical, but I hoped the response time would be significantly less. So I replaced the line of code that directly invoked the MongoDB aggregation command, collection.aggregate(pipeline), with a call to my new function aggregate_facets_in_parallel(collection, pipeline), instead. The implementation of this function can be seen in the Python file parallel_facets_agg.py. The function uses Python's multiprocessing.pool.ThreadPool library to send each facet's sub-pipeline in a separate client thread and waits for all parallel aggregations to complete before returning the combined result.

This time, when I ran the test Python script against the same data set, I received the exact same result, but in a time of just 3:30 minutes (versus the original time of 5:20 minutes). This is not a bad speed up! :-D

Some Observations

Some people may look at this and ask why, given that there were 3 facets, the aggregation didn't respond in just one third of the original time (i.e. in around 1:47 minutes). Well, there are many reasons, including:

This would assume each separate facet sub-pipeline takes the same amount of time to execute, which is highly unlikely. The different facet sub-pipelines will each have different complexities and processing requirements. The overall response time cannot be any faster than the slowest of the 3 facet sub-pipelines.
Just because the client code spawns 3 "concurrent" threads, it doesn't mean that these 3 threads are actually running completely in parallel. For example, my laptop has 2 CPU cores, which would be a cause of some resource contention. There will of course be many other potential causes of resource contention, such as multiple threads competing to retrieve different data from the same hard drive.
For my simple tests, the client test Python script is running on the same machine as the MongoDB database, and thus will consume some of the compute capacity (albeit, for these tests, it will mostly just be blocking and waiting).
In most real world cases (but not for my simple tests here), there may also be other workloads being processed by the MongoDB database simultaneously, consuming significant portions of the shared compute resources.

The other question people may ask is, if this is so simple, why doesn't the MongoDB server implement such parallelism itself for processing the different sections of an aggregation $facet stage. There are at least two reasons why this is not the case in MongoDB:

My test scenario places some restrictions on the aggregation pipeline as a whole. Specifically, the top level pipeline must only contain one stage (the $facet stage) and my custom function throws an exception if this is not the case. This is fine and quite common where the workload is an analytical workload that needs to "full table scan" most or all of a data set. However, in the original retail example at the top of the post, the likelihood is that there would need to be a $match stage, before the $facet stage, to first restrict the multi-faceted clothes classifications based on a filter that the user has entered (e.g. product name contains "Black Trainers"). Thus, it may well be more efficient to perform the $match just once, to reduce the set of data to work with, before having this data passed on to a $facet stage. The workaround would be to duplicate the $match as the first stage of each of the $facet sub-pipelines, which could well turn out to be slower, as the same work would be repeated.
For the most part, MongoDB's runtime architecture does not attempt to divide and process an individual client request's CRUD operations into parallel chunks, and instead processes the elements of an individual request serially. One reason why this is a good thing, is that typically a MongoDB database will be processing many requests in parallel from many clients. If one particular request was allowed to dominate the system's resources, by being parallelised "server-side" for a "burst of time", this may adversely affect other requests and cause the database to exhibit inconsistent performance as a whole. MongoDB's architecture generally encourages a fairer share of resources, spread across all clients' requests. For this reason, you may want to carefully consider how much you use the "client-side parallelism" tip in this blog post, in order to avoid abusing this "fair share" trust.

Summary

I've shown an example in this blog post of how multi-faceted MongoDB aggregations can be sped up by encouraging parallelism, from the client application's perspective. The choice of Python to implement this was fairly arbitrary. I could have implemented it in any programming languages that there is a MongoDB Driver for, using the appropriate multi-threading libraries for that language. The parallelism benefits discussed in this post are really aimed at analytics type workloads that need to process a whole data-set to produce multi-faceted insight. This is in contrast to the sorts of workloads that would first match a far smaller subset of records, against an index, before then aggregating on the small data subset.

Song for today: Bella Muerte by The Patch of Sky

Deploying a MongoDB Sharded Cluster using Kubernetes StatefulSets on GKE

2017-07-13T10:23:00.000+01:00

[Part 4 in a series of posts about running MongoDB on Kubernetes, with the Google Kubernetes Engine (GKE). For this post, a newer GitHub project gke-mongodb-shards-demo has been created to provide an example of a scripted deployment for a Sharded cluster specifically. This gke-mongodb-shards-demo project also incorporates the conclusions from the earlier posts in the series. Also see: http://k8smongodb.net/]

Introduction

In the previous posts of my blog series (1, 2, 3), I focused on deploying a MongoDB Replica Set in GKE's Kubernetes environment. A MongoDB Replica Set provides data redundancy and high availability, and is the basic building block for any mission critical deployment of MongoDB. In this post, I now focus on the deployment of a MongoDB Sharded Cluster, within the GKE Kubernetes environment. A Sharded Cluster enables the database to be scaled out over time, to meet increasing throughput and data volume demands. Even for a Sharded Cluster, the recommendations from my previous posts are still applicable. This is because each Shard is a Replica Set, to ensure that the deployment exhibits high availability, in addition to scalability.

Deployment Process

My first blog post on MongoDB & Kubernetes observed that, although Kubernetes is a powerful tool for provisioning and orchestrating sets of related containers (both stateless and now stateful), it is not a solution that caters for every required type of orchestration task. These tasks are invariably technology specific and need to operate below or above the “containers layer”. The example I gave in my first post, concerning the correct management of a MongoDB Replica Set's configuration, clearly demonstrates this point.

In the modern Infrastructure as Code paradigm, before orchestrating containers using something like Kubernetes, other tools are first required to provision infrastructure/IaaS artefacts such as Compute, Storage and Networking. You can see a clear example of this in the provisioning script used in my first post, showing non-Kubernetes commands ("gcloud"), which are specific to the Google's Compute Platform (GCP), being used first, to provision storage disks. Once containers have been provisioned by a tool like Kubernetes, higher level configuration tasks, such as data loading, system user identity provisioning, secure network modification for service exposure and many other "final bootstrap" tasks, will also often need to be scripted.

With the requirement here to deploy a MongoDB Sharded Cluster, the distinction between container orchestration tasks, lower level infrastructure/IaaS provisioning tasks and higher level technology-specific orchestration tasks, becomes even more apparent...

For a MongoDB Sharded Cluster on GKE, the following categories of tasks must be implemented:

Infrastructure Level (using Google's "gcloud" tool)

Create 3 VM instances
Create storage disks of various sizes, for containers to attach to

Container Level (using Kubernetes' "kubectl" tool)

Provision 3 "startup-script" containers using a Kubernetes DaemonSet, to enable the XFS filesystem to be used and to disable Huge Pages
Provision 3 "mongod" containers using a Kubernetes StatefulSet, ready to be used as members of the Config Server Replica Set to host the ConfigDB
Provision 3 separate Kubernetes StatefulSets, one per Shard, where each StatefulSet is composed of 3 "mongod" containers ready to be used as members of the Shard's Replica Set
Provision 2 "mongos" containers using a Kubernetes Deployment, ready to be used for managing and routing client access to the Sharded database

Database Level (using MongoDB's "mongo shell" tool)

For the Config Servers, run the initialisation command to form a Replica Set
For each of the 3 Shards (composed of 3 mongod processes), run the initialisation command to form a Replica Set
Connecting to one of the Mongos instances, run the addShard command, 3 times, once for each of the Shards, to enable the Sharded Cluster to be fully assembled
Connecting to one of the Mongos instances, under localhost exception conditions, create a database administrator user to apply to the cluster as a whole

The quantities of resources highlighted above are specific to my example deployment, but the types and order of provisioning steps apply regardless of deployment size.

In the accompanying example project, for brevity and clarity, I use a simple Bash shell script to wire together these three different categories of tasks. In reality, most organisations would use more specialised automation software, such as Ansible, Puppet, or Chef, for example, to glue all such steps together.

Kubernetes Controllers & Pods

The Kubernetes StatefulSet definitions for the MongoDB Config Server "mongod" containers and the Shard member "mongod" containers hardly differ from those described in my first blog post.

Below is an excerpt of the StatefulSet definition for each Config Server mongod container (shard specific addition highlighted in bold):

containers:
- name: mongod-configdb-container
image: mongo
command:
- "mongod"
- "--port"
- "27017"
- "--bind_ip"
- "0.0.0.0"
- "--wiredTigerCacheSizeGB"
- "0.25"
- "--configsvr"
- "--replSet"
- "ConfigDBRepSet"
- "--auth"
- "--clusterAuthMode"
- "keyFile"
- "--keyFile"
- "/etc/secrets-volume/internal-auth-mongodb-keyfile"
- "--setParameter"
- "authenticationMechanisms=SCRAM-SHA-1"

Below is an excerpt of the StatefulSet definition for each Shard member mongod container (shard specific addition highlighted in bold):

containers:
- name: mongod-shard1-container
image: mongo
command:
- "mongod"
- "--port"
- "27017"
- "--bind_ip"
- "0.0.0.0"
- "--wiredTigerCacheSizeGB"
- "0.25"
- "--shardsvr"
- "--replSet"
- "Shard1RepSet"
- "--auth"
- "--clusterAuthMode"
- "keyFile"
- "--keyFile"
- "/etc/secrets-volume/internal-auth-mongodb-keyfile"
- "--setParameter"
- "authenticationMechanisms=SCRAM-SHA-1"

For the Shard's container definition, the name of the specific Shard's Replica Set is declared. This Shard definition will result in 3 mongod replica containers being created for the Shard Replica Set, called "Shard1RepSet". Two additional and similar StatefulSet resources also have to be defined, to represent the second Shard ("Shard2RepSet") and third Shard ("Shard3RepSet") too.

To provision the Mongos Routers, a StatefulSet is not used. This is because neither persistent storage nor a fixed network hostname are required. Mongos Routers are stateless and, to a degree, ephemeral. Instead, a Kubernetes Deployment resource is defined, which is the Kubernetes preferred approach for stateless services. Below is the Deployment definition for the Router mongos container:

apiVersion: apps/v1beta1
kind: Deployment
metadata:
name: mongos
spec:
replicas: 2
template:
spec:
volumes:
- name: secrets-volume
secret:
secretName: shared-bootstrap-data
defaultMode: 256
containers:
- name: mongos-container
image: mongo
command:
- "numactl"
- "--interleave=all"
- "mongos"
- "--port"
- "27017"
- "--bind_ip"
- "0.0.0.0"
- "--configdb"
- "ConfigDBRepSet/mongod-configdb-0.mongodb-configdb-service.default.svc.cluster.local:27017,mongod-configdb-1.mongodb-configdb-service.default.svc.cluster.local:27017,mongod-configdb-2.mongodb-configdb-service.default.svc.cluster.local:27017"
- "--clusterAuthMode"
- "keyFile"
- "--keyFile"
- "/etc/secrets-volume/internal-auth-mongodb-keyfile"
- "--setParameter"
- "authenticationMechanisms=SCRAM-SHA-1"
ports:
- containerPort: 27017
volumeMounts:
- name: secrets-volume
readOnly: true
mountPath: /etc/secrets-volume

The actual structure of this resource definition is not a radical departure from the definition of a StatefulSet. A different command has been declared ("mongos", rather than "mongod"), but the same base container image has been referenced (mongo image from Docker Hub). For the "mongos" container, it is still important to enable authentication and to reference the generated cluster key file. Specific to the "mongos" parameter list is "--configdb", to specify the URL of the "ConfigDB" (the 3-node Config Server Replica Set). This URL will be connected to by each instance of the mongos router, to enable discovery of the Shards in the cluster and to determine which Shard holds specific ranges of the stored data. Due to the fact that the "mongod" containers, used to host the "ConfigDB", are deployed as a StatefulSet, their hostnames remain constant. As a result, a fixed URL can be defined in the "mongos" container resource definition, as shown above.

Once the "generate" shell script in the example project has been executed, and all the Infrastructure, Kubernetes and Database provisioning steps have successfully completed, 17 different Pods will be running, each hosting one container. Below is a screenshot showing the results from running the "kubectl" command, to list all the running Kubernetes Pods:

This tallies up with what Kubernetes was asked to be deployed, namely:

3x DaemonSet startup-script containers (one per host machine)
3x Config Server mongod containers
3x Shards, each composed of 3 replica mongod containers
2x Router mongos containers

Only the "mongod" containers for the Config Servers and the Shard Replicas have fixed and deterministic names, because these were deployed as Kubernetes StatefulSets. The names of the other containers reflect the fact that they are regarded as stateless, disposable and trivial to re-create, on-demand, by Kubernetes.

With the full deployment generated and running, it is a straight forward process to connect to the sharded cluster to check its status. The screenshot below shows the use of the "kubectl" command to open a Bash shell connected to the first "mongos" container. From the Bash shell, the "mongo shell" has been opened, connecting to the local "mongos" process running in the same container. Before the command to view the status of the Sharded cluster has been run, the database has first been authenticated with, using the "admin" user.

The output of the status command shows that the URLs of all three Shards that have been defined (each is a MongoDB Replica Set). Again, these URLs remain fixed by virtue of the use of Kubernetes StatefulSets for the "mongod" containers that compose each Shard.

UPDATE 02-Jan-2018: Since writing this blog post, I realised that because the mongos routers ideally require stable hostnames, to be easily referenceable from the app tier, the mongos router containers should also be declared and deployed as a Kubernetes StatefulSet and Service, rather than a Kubernetes Deployment. The GitHub project gke-mongodb-shards-demo, associated with this blog post, has been changed to reflect this.

Summary

In this blog post I’ve shown how to deploy a MongoDB Sharded Cluster using Kubernetes with the Google Kubernetes Engine. I've mainly focused on the high level considerations for such deployments, rather than listing every specific resource and step required (for that level of detail, please view the accompanying GitHub project). What this study does reinforce, is that a single container orchestration framework (Kubernetes in this case) does not cater for every step required to provision and deploy a highly available and scalable database, such as MongoDB. I believe that this would be true for any non-trivial and mission critical distributed application or set of services. However, I don't see this is a bad thing. Personally, I value flexibility and choice, and having the ability to use the right tool for the right job. In my opinion, a container orchestration framework that tries to cater for things beyond its obvious remit would be the wrong framework. A framework that attempts to be all things to all people, would end up diluting its own value, down to the lowest common denominator. At the very least, it would become far too prescriptive and restrictive. I feel that Kubernetes, in its current state, strikes a suitable and measured balance, and with its StatefulSets capability, provides a good home for MongoDB deployments.

Song for today: Three Days by Jane's Addiction

Using the Enterprise Version of MongoDB on GKE Kubernetes

2017-06-30T22:58:00.000+01:00

[Part 3 in a series of posts about running MongoDB on Kubernetes, with the Google Kubernetes Engine (GKE). See the GitHub project gke-mongodb-demo for an example scripted deployment of MongoDB to GKE, that you can easily try yourself. The gke-mongodb-demo project combines the conclusions from all the posts in this series so far. Also see: http://k8smongodb.net/]

Introduction

In the previous two posts of my blog series (1, 2) about running MongoDB on GKE's Kubernetes environment, I showed how to ensure a MongoDB Replica Set is secure by default, resilient to system failures and how to ensure various best practice "production" environment settings are in place. In those examples, the community version of the MongoDB binaries were used. In this blog post I show how the enterprise version of MongoDB can be utilised, instead.

Referencing a Docker Image for Use in Kubernetes

For the earlier two blog post examples, a pre-built "mongo" Docker image was "pulled" by Kubernetes, from Docker Hub's "official" MongoDB repository. Below is an excerpt of the Kubernetes StatefulSet definition, that shows how this image was referenced (highlighted in bold):

$ cat mongodb-service.yaml
....
....
containers:
- name: mongod-container
image: mongo
command:
- "mongod"
....
....

By default, if no additional metadata is provided, Kubernetes will look in the Docker Hub repository for the image with the given name. Other repositories such as Google's Container Registry, AWS's EC2 Container Registry, Azure's Container Registry, or any private repository, can be used instead.

It is worth clarifying what is meant by the "official" MongoDB repository. This is a set of images that are "official" from Docker Hub's perspective because Docker Hub manages how the images are composed and built. They are not, however, official releases from MongoDB Inc.'s perspective. When the Docker Hub project builds an image, in addition to sourcing the "mongod" binary from MongoDB Inc's website, other components like the underlying Debian OS, plus various custom scripts, are generated into the image too.

At the time of this blog post, Docker Hub currently provides images for MongoDB community versions 3.0, 3.2, 3.4 and 3.5 (unstable). The Docker Hub repository only contains images for the community version of MongoDB and not the enterprise version.

Building a Docker Image Using the MongoDB Enterprise binaries

You can build a Docker image to run "mongod" in any way you want, using your own custom Dockerfile to define how the image should be generated. The Docker manual even provides a tutorial for creating a custom Docker image specifically for MongoDB, called Dockerize MongoDB. That process can be followed as the basis for building an image which pulls down and uses the enterprise version of MongoDB, rather than the community version.

However, to generate an image containing the enterprise MongoDB version, it isn't actually necessary to create your own custom Dockerfile. This is because, a few weeks ago, I created a pull request to add support for building MongoDB enterprise based images, as part of the normal "mongo" GitHub project that is used to generate the "official" Docker Hub "mongo" images. The GitHub project owner accepted and merged this enhancement a few days later. My enhancement essentially allows the project's Dockerfile, when used by Docker's build tool, to instruct that the enterprise MongoDB binaries should be downloaded into the image, rather than the community ones. This is achieved by providing some "build arguments" to the Docker build process, which I will detail further below. This doesn't mean the Docker Hub's "official repository" for MongoDB now contains pre-generated "enterprise mongod" images ready to use. It just means you can use the source Github project directly, without any changes to the project's Dockerfile, to generate your own image, containing the enterprise binary.

To build the Docker "mongo" image, that pulls in the enterprise version of MongoDB, follow these steps:

1. Install Docker locally on your own workstation/laptop.

2. Download the source files for the Docker Hub "mongo" project (on the project page, click the "Clone or download" green button, then click the "Download zip" link and once downloaded, unpack into a new local folder).

3. From a command line shell, in the project folder, change directory to the sub-folder of the major version you want to build (e.g. "3.4"), and run:

$ docker build -t pkdone/mongo-ent:3.4 --build-arg MONGO_PACKAGE=mongodb-enterprise --build-arg MONGO_REPO=repo.mongodb.com .

This tells Docker to build an image using the Dockerfile in the current directory (".") with the resulting image name being "pkdone/mongo-ent" and tag being ":3.4". By convention, the image name is prefixed by the author's username, which in my case is "pkdone" (obviously this should be replaced by a different prefix, for whoever follows these steps).

The two new "--build-arg" parameters, "MONGO_PACKAGE" and "MONGO_REP" are passed to the "mongo" Dockerfile. The version of the Dockerfile with my enhancements uses these two parameters to locate where to download the specific type of MongoDB binary from. In this case, the values specified mean that the enterprise binary is pulled down into the generated image. If no "build args" are specified, the community version of MongoDB is used, by default.

Important note: When running the "docker build" command above, because the enterprise version of MongoDB will be downloaded, it will mean you are implicitly accepting MongoDB Inc's associated commercial licence terms.

Once the image is generated, you can also quickly test the image by running it in a Docker container on your local machine (as just a "mongod" single instance):

$ docker run -d --name mongo -t pkdone/mongo-ent:3.4

To be sure this is running properly, connect to the container using a shell, check if the "mongod" process is running and check that the Mongo Shell can connect to the containerised "mongod" process.

$ docker exec -it mongo bash
$ ps -aux
$ mongo

The output of the Shell should include the prompt "MongoDB Enterprise >" which shows that the database is using the enterprise version of MongoDB. Exit out of the Mongo Shell, exit out of the container and then from the local machine, run the command to view the "mongod" container's logged output (with example results shown):

$ docker logs mongo | grep enterprise

2017-07-01T12:08:42.177+0000 I CONTROL [initandlisten] modules: enterprise

Again this result should demonstrate that the enterprise version of MongoDB has been used.

Using the Generated Enterprise Mongod Image from the Kubernetes Project

The easiest way to use the "mongod" container image that has just been created, from GKE Kubernetes, is to first register it with Docker Hub, using the following steps:

  1. Create a new free account on Docker Hub.

  2. Run the following commands to associate your local workstation environment with your new Docker Hub account, to list the built image registered on your local machine, and to push this newly generated image to your remote Docker Hub account:

$ docker login
$ docker images
$ docker push pkdone/mongo-ent:3.4

  3. Once the image has finished uploading, in a browser, return to the Docker Hub site and log-in to see the list of your registered repository instances. The newly pushed image should now be listed there.

Now back in the GKE Kubernetes project, for the "mongod" Service/StatefulSet resource definition, change the image reference to be the newly uploaded Docker Hub image, as shown below (highlighted in bold):

$ cat mongodb-service.yaml
....
....
containers:
- name: mongod-container
  image: pkdone/mongo-ent:3.4
command:
- "mongod"
....
....

Now re-perform all the normal steps to deploy the Kubernetes cluster and resources as outlined in the first blog post in the series. Once the MongoDB Replica Set is up and running, you can check the output logs of the first "mongod" container, to see if the enterprise version of MongoDB is being used:

$ kubectl logs mongod-0 | grep enterprise

2017-07-01T13:01:42.794+0000 I CONTROL [initandlisten] modules: enterprise

Summary

In this blog post I’ve shown how to use the enterprise version of MongoDB, when running a MongoDB Replica Set, using Kubernetes StatefulSets, on the Google Kubernetes Engine. This builds on the work done in previous blog posts in the series, around ensuring MongoDB Replica Sets are resilient and better tuned for production workloads, when running on Kubernetes.

[Next post in series: Deploying a MongoDB Sharded Cluster using Kubernetes StatefulSets on GKE]

Song for today: Standing In The Way Of Control by Gossip

Configuring Some Key Production Settings for MongoDB on GKE Kubernetes

2017-06-27T22:37:00.000+01:00

[Part 2 in a series of posts about running MongoDB on Kubernetes, with the Google Kubernetes Engine (GKE). See the GitHub project gke-mongodb-demo for an example scripted deployment of MongoDB to GKE, that you can easily try yourself. The gke-mongodb-demo project combines the conclusions from all the posts in this series so far. Also see: http://k8smongodb.net/]

Introduction

In the first part of my blog series I showed how to deploy a MongoDB Replica Set to GKE's Kubernetes environment, whilst ensuring that the replica set is secure by default and resilient to various types of system failures. As mentioned in that post, there are number of other "production" considerations that need to be made when running MongoDB in Kubernetes and Docker environments. These considerations are primarily driven by the best practices documented in MongoDB’s Production Operations Checklist and Production Notes. In this blog post, I will address how to apply some (but not all) of these best practices, on GKE's Kubernetes platform.

Host VM Modifications for Using XFS & Disabling Hugepages

For optimum performance, the MongoDB Production Notes strongly recommend applying the following configuration settings to the host operating system (OS):

Use an XFS based Linux filesystem for WiredTiger data file persistence.
Disable Transparent Huge Pages.

The challenge here is that neither of these elements can be configured directly within normally deployed pods/containers. Instead, they need to be set in the OS of each machine/VM that is eligible to host one or more pods and their containers. Fortunately, after a little googling I found a solution to incorporating XFS, in the article Mounting XFS on GKE, which also provided the basis for deriving a solution for disabling Huge Pages too. It turns out that in Kubernetes, it is possible to run a pod (and its container) once per node (host machine), using a facility called a DaemonSet. A DaemonSet is used to schedule a "special" container to run on every newly provisioned node, as a one off, before any "normal" containers are scheduled and run on the node. In addition, for Docker based containers (the default on GKE Kubernetes), the container can be allowed to run in a privileged mode, which gives the "privileged" container access to other Linux Namespaces running in the same host environment. With heightened security rights the "privileged" container can then run a utility called nsenter ("NameSpace ENTER") to spawn a shell using the namespace belonging to the host OS ("/proc/1"). The script that the shell runs can then essentially perform any arbitrary root level actions on the underlying host OS.

So with this in mind, the challenge is to build a Docker container image that, when run in privileged mode, uses "nsenter" to spawn a shell to run some shell script commands. As luck would have it, such a container has already been created, in a generic way, as part of the Kubernetes "contributions" project, called startup-script. The generated "startup-script" Docker image has been registered and and made available in the Google Container Registry, ready to be pulled in and used by anyone's Kubernetes projects.

Therefore on GKE, to create a DaemonSet leveraging the "startup-script" image in privileged mode, we first need to define the DaemonSet's configuration:

$ cat hostvm-node-configurer-daemonset.yaml

kind: DaemonSet
apiVersion: extensions/v1beta1
metadata:
name: hostvm-configurer
labels:
app: startup-script
spec:
template:
metadata:
labels:
app: startup-script
spec:
hostPID: true
containers:
- name: hostvm-configurer-container
image: gcr.io/google-containers/startup-script:v1
securityContext:
privileged: true
env:
- name: STARTUP_SCRIPT
value: |
#! /bin/bash
set -o errexit
set -o pipefail
set -o nounset

# Disable hugepages
echo 'never' > /sys/kernel/mm/transparent_hugepage/enabled
echo 'never' > /sys/kernel/mm/transparent_hugepage/defrag

# Install tool to enable XFS mounting
apt-get update || true
apt-get install -y xfsprogs

Shown in bold at the base of the file, you will notice the commands used to disable Huge Pages and to install the XFS tools for mounting and formatting storage using the XFS filesystem. Further up the file, in bold, is the reference to the 3rd party "startup-script" image from the Google Container Registry and the security context setting to state that the container should be run in privileged mode.

Next we need to deploy the DaemonSet with its "start-script" container to all the hosts (nodes), before we attempt to create any GCE disks, that need to be formatted as XFS:

$ kubectl apply -f hostvm-node-configurer-daemonset.yaml

In the GCE disk definitions, described in the first blog post in this series (i.e. "gce-ssd-persistentvolume?.yaml"), an addition of a new parameter needs to be made (shown in bold below) to indicate that the disk's filesystem type needs to be XFS:

apiVersion: "v1"
kind: "PersistentVolume"
metadata:
name: data-volume-1
spec:
capacity:
storage: 30Gi
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
storageClassName: fast
gcePersistentDisk:
fsType: xfs
pdName: pd-ssd-disk-1

Now in theory, this should be all that is required to get XFS working. Except on GKE, it isn't!

After deploying the DaemonSet and creating the GCE storage disks, the deployment of the "mongod" Service/StatefulSet will fail. The StatefulSet's pods do not to start properly because the disks can't be formatted and mounted as XFS. It turns out that this is because, by default, GKE uses a variant of Chromium OS as the underlying host VM that runs the containers, and this OS flavour doesn't support XFS. However, GKE can also be configured to use a Debian based Host VM OS instead, which does support XFS.

To see the list of host VM OSes that GKE supports, the following command can be run:

$ gcloud container get-server-config
Fetching server config for europe-west1-b
defaultClusterVersion: 1.6.4
defaultImageType: COS
validImageTypes:
- CONTAINER_VM
- COS
....

Here, "COS" is the label for the Chromium OS and "CONTAINER_VM" is the label for the Debian OS. The easiest way to start leveraging the Debian OS image is to clear out all the GCE/GKE resources and Kubernetes cluster from the current project and start deployment all over again. This time, when the initial command is run to create the new Kubernetes cluster, an additional argument (shown in bold) must be provided to define that the Debian OS should be used for each Host VM that is created as a Kubernetes node.

$ gcloud container clusters create "gke-mongodb-demo-cluster" --image-type=CONTAINER_VM

This time, when all the Kubernetes resources are created and deployed, the "mongod" containers correctly utilise XFS formatted persistent volumes.

If this all seems a bit complicated, it is probably helpful to view the full end-to-end deployment flow, provided in my example GitHub project gke-mongodb-demo.

There is one final observation to make before finishing the discussion on XFS. In Google's online documentation, it is stated that the Debian Host VM OS is deprecated in favour of Chromium OS. I hope that in the future Google will add XFS support directly to its Chromium OS distribution, to make the use of XFS a lot less painful and to ensure XFS can still be used with MongoDB, if the Debian Host VM option is ever completely removed.

UPDATE 13-Oct-2017: Google has recently updated GKE and in addition to upgrading Kubernetes to version 1.7, has removed the old Debain container VM option and added an Ubuntu container VM option, instead. In the new Ubuntu container, the XFS tools are already installed, and therefore do not need to be configured by the DaemonSet. The gke-mongodb-demo project has been updated accordingly, to use the new Ubuntu container VM and to omit the command to install the XFS tools.

Disabling NUMA

For optimum performance, the MongoDB Production Notes recommend that "on NUMA hardware, you should configure a memory interleave policy so that the host behaves in a non-NUMA fashion". The DockerHub "mongo" container image which has been used so far with Kubernetes in this blog series, already contains some bootstrap code to start the "mongod" process with the "numactl --interleave=all" setting. This setting makes the process environment behave in a non-NUMA way.

However, I believe it is worth specifying the "numactl" settings explicitly in the "mongod" Service/StatefulSet resource definition, anyway, just in case other users choose to use an alternative or self-built Docker image for the "mongod" container. The excerpt below shows the added "numactl" elements (in bold), required to run the containerised "mongod" process in a "non-NUMA" manner.

$ cat mongodb-service.yaml
....
....
containers:
- name: mongod-container
image: mongo
command:
- "numactl"
- "--interleave=all"
- "mongod"
....
....

Controlling CPU & RAM Resource Allocation Plus WiredTiger Cache Size

Of course, when you are running a MongoDB database it is important to size both CPU and RAM resources correctly for the particular database workload, regardless of the type of host environment. In a Kubernetes containerised host environment, the amount of CPU & RAM resource dedicated to a container can be defined in the "resource" section of the container's declaration, as shown in the excerpt of the "mongod" Service/StatefulSet definition below:

$ cat mongodb-service.yaml
....
....
containers:
- name: mongod-container
image: mongo
command:
- "mongod"
- "--wiredTigerCacheSizeGB"
- "0.25"
- "--bind_ip"
- "0.0.0.0"
- "--replSet"
- "MainRepSet"
- "--auth"
- "--clusterAuthMode"
- "keyFile"
- "--keyFile"
- "/etc/secrets-volume/internal-auth-mongodb-keyfile"
- "--setParameter"
- "authenticationMechanisms=SCRAM-SHA-1"
resources:
requests:
cpu: 1
memory: 2Gi
....
....

In the example (shown in bold), 1x virtual CPU (vCPU) and 2GB of RAM have been requested to run the container. You will also notice that an additional parameter has been defined for "mongod", specifying the WiredTiger internal cache size ("--wiredTigerCacheSizeGB"). In a containerised environment it is absolutely vital to explicitly state this value. If this is not done, and multiple containers end up running on the same host machine (node), MongoDB's WiredTiger storage engine may attempt to take more memory than it should. This is because of the way a container "reports" it's memory size to running processes. As per the MongoDB Production Recommendations, the default cache size guidance is: "50% of RAM minus 1 GB, or 256 MB". Given that the amount of memory requested is 2GB, the WiredTiger cache size here, has been set to 256MB.

If and when you define a different amount of memory for the container process, be sure to also adjust the WiredTiger cache size setting accordingly, otherwise the "mongod" process may not leverage all the memory reserved for it, by the container.

Controlling Anti-Affinity for Mongod Replicas

When running a MongoDB Replica Set, it is important to ensure that none of the "mongod" replicas in the replica set are running on the same host machine as each other, to avoid inadvertently introducing a single point of failure. In a Kubernetes containerised environment, if containers are left to their own devices, different "mongod" containers could end up running on the same nodes. Kubernetes provides a way of specifying pod anti-affinity to prevent this from occurring. Below is an excerpt of a "mongod" Services/StatefulSet resource file which declares an anti-affinity configuration.

$ cat mongodb-service.yaml
....
....
serviceName: mongodb-service
replicas: 3
template:
metadata:
labels:
replicaset: MainRepSet
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: replicaset
operator: In
values:
- MainRepSet
topologyKey: kubernetes.io/hostname
....
....

Here, a rule has been defined that asks Kubernetes to apply anti-affinity when deploying pods with the label "replicaset" equal to "MainRepSet", by looking for potential matches on the host VM instance's hostname, and then avoiding them.

Setting File Descriptor & User Process Limits

When deploying the MongoDB Replica Set on GKE Kubernetes, as demonstrated in the current GitHub project gke-mongodb-demo, you may notice some warning about "rlimits" in the output of each containerised mongod's logs. These log entries can be viewed by running the following command:

$ kubectl logs mongod-0 | grep rlimits

2017-06-27T12:35:22.018+0000 I CONTROL [initandlisten] ** WARNING: soft rlimits too low. rlimits set to 29980 processes, 1000000 files. Number of processes should be at least 500000 : 0.5 times number of files.

The MongoDB manual provides some recommendations concerning the system settings for the maximum number of processes and open files when running a "mongod" process.

Unfortunately, thus far, I've not established an appropriate way to the enforce these thresholds using GKE Kubernetes. This topic will possibly be the focus of a blog post for another day. However, I thought that it would be informative to highlight the issue here, with the supporting context, to allow others the chance to resolve it first.

UPDATE 13-Oct-2017: Google has recently updated GKE and in addition to upgrading Kubernetes to version 1.7, has removed the old Debain container VM option and added an Ubuntu container VM option, instead. The default "rlimits" settings in the Ubuntu container VM, are already appropriate for running MongoDB. Therefore a fix is no longer required to address this issue.

Summary

In this blog post I’ve provided some methods for addressing certain best practices when deploying a MongoDB Replica Set to the Google Kubernetes Engine. Although this post does not provide an exhaustive list of best practice solutions, I hope it proves useful for others (and myself) to build upon, in the future.

[Next post in series: Using the Enterprise Version of MongoDB on GKE Kubernetes]

Song for today: The Mountain by Jambinai

Deploying a MongoDB Replica Set as a GKE Kubernetes StatefulSet

2017-06-25T20:42:00.003+01:00

[Part 1 in a series of posts about running MongoDB on Kubernetes, with the Google Kubernetes Engine (GKE). See the GitHub project gke-mongodb-demo for an example scripted deployment of MongoDB to GKE, that you can easily try yourself. The gke-mongodb-demo project combines the conclusions from all the posts in this series so far. Also see: http://k8smongodb.net/]

Introduction

A few months ago, Sandeep Dinesh of Google wrote an informative blog post about Running MongoDB on Kubernetes with StatefulSets on Google’s Cloud Platform. I found this to be a great resource to bootstrap my knowledge of Kubernetes’ new StatefulSets feature, and food for thought for approaches for deploying MongoDB on Kubernetes generally. StatefulSets is Kubernetes’ framework for providing better support for “stafeful applications”, such as databases and message queues. StatefulSets provides the capabilities of stable unique network hostnames and stable dedicated network storage volume mappings, essential for a database cluster to function properly and for data to exist and outlive the lifetime of inherently ephemeral containers.

My view of the approach in the Google blog post, is it is a great way for a developer to rapidly spin up a MongoDB Replica Set, to quickly test that their code still works correctly (it should) in a clustered environment. However, the approach cannot be regarded as a best practice for deploying MongoDB in Production, for mission critical use cases. This assertion is not a criticism, as the blog post is obviously intended to show the art of the possible (which it does very eloquently), and the author makes no claim to be a seasoned MongoDB administration expert.

So what are the challenges for Production deployments, in the approach outlined in the Google blog post? Well there are two problems, which I will address in this post:

Use of a MongoDB/Kubernetes sidecar per Pod, to control Replica Set configuration. Essentially, the sidecar wakes up every 5 seconds, checks which MongoDB pods are running and then reconfigures the replica-set, on the fly. It adds any MongoDB servers it can see, to the replica set configuration, and removes any servers it can no longer see. This is dangerous for many reasons. I’ve highlighted two of the most important reasons why here*:

This introduces the real risk of split-brain, in the event of a network partition. For example, normally, if there is a 3 node replica set configured and the primary is somehow separated from the secondaries, the primary will step down as it can’t maintain a majority. Normally, the two secondaries that can now only see each other, will form a quorum and one of these two will then become the primary. In the sidecar implementation, during a network split, the sidecar on the primary believes the two secondaries aren’t running and it re-configures the replica set on the fly, to now just have one member. This remaining member believes it can still act as primary (because it has achieved a majority of 1 out of 1 votes). The sidecars still running on the other two members, now also reconfigure the replica set to be just those two members. One of these two members automatically becomes a primary (because it has achieved a majority of 2 out of 2 votes). As a result there are now two primaries in existence for the same replica-set, which a normal and properly configured MongoDB cluster would never allow to occur. MongoDB’s strong consistency guarantees are subverted and non-deterministic things will start happening to the data. In a properly deployed MongoDB cluster, if there is a 3 node replica set and 2 nodes appear to be down, it doesn’t mean you now have a 1 node replica set, you don’t. You still have a 3 node replica-set, albeit only one replica appears to be currently running (and hence no primary is permitted, to guarantee safety and strong consistency).
Many applications updating data in MongoDB will use “WriteConcerns” set to a value such as “majority”, to provide levels of guarantee for safe data updates across a cluster. The whole notion of a “WriteConcern” would become meaningless in the sidecar controlled environment, because the constantly re-configured replica set would always reflect a total replica-set size of just those replicas currently active and reachable. For example, performing a database update operation with “WriteConcern” of “majority” would always be permitted, regardless of whether all 3 replicas are currently available, or just 2 replicas are or just 1 replica is.

Insecure by default, due to authentication not being enabled. In a Production environment, running MongoDB with authentication disabled should never be allowed. Even if the intention is to configure authentication as a later provisioning step, the database is potentially exposed and insecure for seconds, minutes or longer. As a result, the “mongod” process should always be started with authentication enabled (e.g. using “--auth” command line flag), even during any “bootstrap provisioning process”. MongoDB’s localhost exception should be relied upon to securely configure one or more database users.

* If this was such an easy and safe thing to do, MongoDB replicas would be built to automatically perform these re-configurations, themselves, in a separate background thread running inside each “mongod” replica process. The brain controlling what the replica-set configuration should look like, lives outside the cluster for good reason (e.g. inside the head of an administrator, or preferably, inside a configuration file that is used to drive a higher level orchestration tool which operates above the “containers layer”).

Additionally, there are number of other considerations that aren’t just specific to the approach in the referenced Google blog post, but are applicable to the use of Docker/Kubernetes with MongoDB, generally. These consideration can be categorised as ways to ensure that MongoDB’s best practices are followed, as documented in MongoDB’s Production Operations Checklist and Production Notes. I address some of these best practice omissions in the next post in this series: Configuring Some Key Production Settings for MongoDB on GKE Kubernetes. It is probably worth me being clear here, that I am not claiming my blog series will get users 100% to where they need to be, to deploy a fully operational, secure and well-performing MongoDB Clusters on GKE. Instead, what I hope the series will do, is enable users to build on my findings and recommendations, so there are less gaps for them to address, when planning their own production environment.

For the rest of this blog post, I will focus on the steps required to deploy a MongoDB Replica Set, on GKE, addressing the replica-set resiliency and security concerns that I've highlighted above.

Steps to Deploy MongoDB to GKE, using StatefulSets

The first thing to do, if you haven’t already, is sign up to use the Google Cloud Platform (GCP). To keeps things simple, you can sign up to a free trial for GCP. Note: The free trial places some restrictions on account resource quotas, in particular restricting storage to a maximum of 100GB. Therefore, in my series of the blog posts and my sample GitHub project, I employ modest disk sizes, to remain under this threshold.

Once your GCP account is activated, you should download and install GCP’s client command line tool, called “gcloud”, to your local Linux/Windows/Mac workstation.

With “gcloud” installed, run the following commands to configure the local environment to use your GCP account, to install the main Kubernetes command tool (“kubectl”), to configure authentication credentials, and to define the default GCP zone to be deployed to:

$ gcloud init
$ gcloud components install kubectl
$ gcloud auth application-default login
$ gcloud config set compute/zone europe-west1-b

Note: If you want to specify an alternative zone to deploy to in the above command, you can first view the list of available zones by running the command: $ gcloud compute zones list

You should now be ready to create a brand new Kubernetes cluster to the Google Kubernetes Engine. Run the following command to provision a new Kubernetes cluster called “gke-mongodb-demo-cluster”:

$ gcloud container clusters create "gke-mongodb-demo-cluster"

As part of this process, a set of 3 GCE VM instances are automatically provisioned, to run Kubernetes cluster nodes ready to host pods of containers.

You can view the state of the deployed Kubernetes cluster using the Google Cloud Platform Console (look at both the “Kubernetes Engine” and the “Compute Engine” sections of the Console).

Next, lets register GCE’s fast SSD persistent disks to be used in the cluster:

$ cat gce-ssd-storageclass.yaml

kind: StorageClass
apiVersion: storage.k8s.io/v1beta1
metadata:
name: fast
provisioner: kubernetes.io/gce-pd
parameters:
type: pd-ssd

$ kubectl apply -f gce-ssd-storageclass.yaml

Then run the commands to allocate 3 lots of Google Cloud storage, of size 30GB, using the fast SSD persistent disks, followed by a query to show the status of those newly created disks:

$ gcloud compute disks create --size 30GB --type pd-ssd pd-ssd-disk-1
$ gcloud compute disks create --size 30GB --type pd-ssd pd-ssd-disk-2
$ gcloud compute disks create --size 30GB --type pd-ssd pd-ssd-disk-3
$ gcloud compute disks list

Now, declare 3 Kubernetes “Persistent Volume” definitions, that each reference one of the storage disks just created:

$ cat gce-ssd-persistentvolume1.yaml

apiVersion: "v1"
kind: "PersistentVolume"
metadata:
name: data-volume-1
spec:
capacity:
storage: 30Gi
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
storageClassName: fast
gcePersistentDisk:
pdName: pd-ssd-disk-1

$ kubectl apply -f gce-ssd-persistentvolume1.yaml

(repeat for Disks 2 and 3, using similar files, “gce-ssd-persistentvolume2.yml” and “gce-ssd-persistentvolume3.yml” respectively, with the fields “name: data-volume-?” and “pdName: pd-ssd-disk-?” set in each file)

Once the three Persistent Volumes are configured, their status can be viewed with the following command:

$ kubectl get persistentvolumes

This will show that the state of each volume is marked as “available” (i.e. no container has staked a claim on each yet).

A key deviation from the original Google blog post, is enabling MongoDB authentication immediately, before any "mongod" processes are started. Enabling authentication for a MongoDB replica set doesn’t just enforce authentication of applications using MongoDB, but also enforces internal authentication for inter-replica communication. Therefore, lets generate a keyfile to be used for internal cluster authentication and register it as a Kubernetes Secret:

$ TMPFILE=$(mktemp)
$ /usr/bin/openssl rand -base64 741 > $TMPFILE
$ kubectl create secret generic shared-bootstrap-data –from file=internal-auth-mongodb-keyfile=$TMPFILE
$ rm $TMPFILE

This generates a random key into a temporary file and then uses the Kubernetes API to register it as a Secret, before deleting the file. Subsequently, the Secret will be made accessible to each “mongod”, via a volume mounted by each host container.

For the final Kubernetes provisioning step, we need to prepare the definition of the Kubernetes Service and StatefulSet for MongoDB, which, amongst other things, encapsulates the configuration of the “mongod” Docker container to be run.

$ cat mongodb-service.yaml

apiVersion: v1
kind: Service
metadata:
name: mongodb-service
labels:
name: mongo
spec:
ports:
- port: 27017
targetPort: 27017
clusterIP: None
selector:
role: mongo
---
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
name: mongod
spec:
serviceName: mongodb-service
replicas: 3
template:
metadata:
labels:
role: mongo
environment: test
replicaset: MainRepSet
spec:
terminationGracePeriodSeconds: 10
volumes:
- name: secrets-volume
secret:
secretName: shared-bootstrap-data
defaultMode: 256
containers:
- name: mongod-container
image: mongo
command:
- "mongod"
- "--bind_ip"
- "0.0.0.0"
- "--replSet"
- "MainRepSet"
- "--auth"
- "--clusterAuthMode"
- "keyFile"
- "--keyFile"
- "/etc/secrets-volume/internal-auth-mongodb-keyfile"
- "--setParameter"
- "authenticationMechanisms=SCRAM-SHA-1"
ports:
- containerPort: 27017
volumeMounts:
- name: secrets-volume
readOnly: true
mountPath: /etc/secrets-volume
- name: mongodb-persistent-storage-claim
mountPath: /data/db
volumeClaimTemplates:
- metadata:
name: mongodb-persistent-storage-claim
annotations:
volume.beta.kubernetes.io/storage-class: "fast"
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 30Gi

You may notice that this Service definition varies in some key areas, from the one provided in the original Google blog post. Specifically:

A “Volume” called “secrets-volume” is defined, ready to expose the shared keyfile to each of the “mongod” replicas that will run.
Additional command line parameters are specified for “mongod”, to enable authentication (“--auth”) and to provide related security settings, including the path where “mongod” should locate the keyfile on its local filesystem.
In the “VolumeMounts” section, the mount point path is specified for the Volume that holds the key file.
The storage request for the Persistent Volume Claim that the container will make, has been reduced from 100GB to 30GB, to avoid issues if using the free trial of the Google Cloud Platform (avoids exhausting storage quotas).
No “sidecar” Container is defined for the same Pod as the “mongod” Container.

Now it’s time to deploy the MongoDB Service and StatefulSet. Run:

$ kubectl apply -f mongodb-service.yaml

Once this has run, you can view the health of the service and pods:

$ kubectl get all

Keep re-running the command above, until you can see that all 3 “mongod” pods and their containers have been successfully started (“Status=Running”).

You can also check the status of the Persistent Volumes, to ensure they have been properly claimed by the running “mongod” containers:

$ kubectl get persistentvolumes

Finally, we need to connect to one of the “mongod” container processes to configure the replica set and specify an administrator user for the database. Run the following command to connect to the first container:

$ kubectl exec -it mongod-0 -c mongod-container bash

This will place you into a command line shell directly in the container. If you fancy it, you can explore the container environment. For example you may want to run the following commands to see what processes are running in the container and also to see the hostname of the container (this hostname should always be the same, because a StatefulSet has been used):

$ ps -aux
$ hostname -f

Connect to the local “mongod” process using the Mongo Shell (it is only possible to connect unauthenticated from the same host that the database process is running on, by virtue of the localhost exception).

$ mongo

In the shell run the following command to initiate the replica set (we can rely on the hostnames always being the same, due to having employed a StatefulSet):

> rs.initiate({_id: "MainRepSet", version: 1, members: [
{ _id: 0, host : "mongod-0.mongodb-service.default.svc.cluster.local:27017" },
{ _id: 1, host : "mongod-1.mongodb-service.default.svc.cluster.local:27017" },
{ _id: 2, host : "mongod-2.mongodb-service.default.svc.cluster.local:27017" }
]});

Keep checking the status of the replica set, with the following command, until you see that the replica set is fully initialised and a primary and two secondaries are present:

> rs.status();

Then run the following command to configure an “admin” user (performing this action results in the “localhost exception” being automatically and permanently disabled):

> db.getSiblingDB("admin").createUser({
user : "main_admin",
pwd : "abc123",
roles: [ { role: "root", db: "admin" } ]
});

Of course, in a real deployment, the steps used above, to configure a replica set and to create an admin user, would be scripted, parameterised and driven by an external process, rather than typed in manually.

That’s it. You should now have a MongoDB Replica Set running on Kubernetes on GKE.

Run Some Quick Tests

Let just prove a couple of things before we finish:

1. Show that data is indeed being replicated between members of the containerised replica set.
2. Show that even if we remove the replica set containers and then re-create them, the same stable hostnames are still used and no data loss occurs, when the replica set comes back online. The StatefulSet’s Persistent Volume Claims should successfully result in the same storage, containing the MongoDB data files, being attached to by the same “mongod” container instance identities.

Whilst still in the Mongo Shell from the previous step, authenticate and quickly add some test data:

> db.getSiblingDB('admin').auth("main_admin", "abc123");
> use test;
> db.testcoll.insert({a:1});
> db.testcoll.insert({b:2});
> db.testcoll.find();

Exit out of the shell and exit out of the first container (“mongod-0”). Then using the following commands, connect to the second container (“mongod-1”), run the Mongo Shell again and see if the data we’d entered via the first replica, is visible to the second replica:

$ kubectl exec -it mongod-1 -c mongod-container bash
$ mongo
> db.getSiblingDB('admin').auth("main_admin", "abc123");
> db.setSlaveOk(1);
> use test;
> db.testcoll.find();

You should see that the two records inserted via the first replica, are visible to the second replica.

To see if Persistent Volume Claims really are working, use the following commands to drop the Service & StatefulSet (thus stopping the pods and their “mongod” containers) and re-create them again (I’ve included some checks in-between, so you can track the status):

$ kubectl delete statefulsets mongodb-statefulset
$ kubectl delete services mongodb-service
$ kubectl get all
$ kubectl get persistentvolumes
$ kubectl apply -f mongodb-service.yaml
$ kubectl get all

As before, keep re-running the last command above, until you can see that all 3 “mongod” pods and their containers have been successfully started again. Then connect to the first container, run the Mongo Shell and execute a query to see if the data we’d inserted into the old containerised replica-set is still present in the re-instantiated replica set:

$ kubectl exec -it mongod-0 -c mongod-container bash
$ mongo
> db.getSiblingDB('admin').auth("main_admin", "abc123");
> use test;
> db.testcoll.find();

You should see that the two records inserted earlier, are still present.

Summary

In this blog post I’ve shown how a MongoDB Replica Set can be deployed, using Kubernetes StatefulSets, to the Google Kubernetes Engine (GKE). Most of the outlined steps (but not all) are actually generic to any type of Kubernetes platform. Critically, I have shown how to ensure the Kubernetes based MongoDB Replica Set is secure by default, and how to ensure the Replica Set can operate normally, to be resilient to various types of system failures.

[Next post in series: Configuring Some Key Production Settings for MongoDB on GKE Kubernetes]

Song for today: Sun by The Hotelier

SHOCKER: XA Distributed Transactions are only Eventually Consistent!

2016-03-18T15:54:00.003+00:00

Apologies for the tabloid-trash style headline. It could have been worse, I could have gone with my working title of "WARNING: XA will eat your first-born"!

This topic has come up in a few conversations I've had recently. It turns out that most don't realise what I'd assumed to be widely understood. 2-phase-commit (2PC), and XA (it's widely used implementation) are NOT ACID compliant. Specifically XA/2PC does not provide strong consistency guarantees and is in-fact just Eventually Consistent. It get's worse in practice. In places where I've seen XA/2PC used, it transpires that Atomicity and Durability are on shaky ground too (more on this later).

Why have I seen this topic rearing it's head recently? Well some organisations have cases where they want to update data in an Oracle Database and in a MongoDB database, as a single transaction. Nothing wrong with that of course, it really just depends on how you choose to implement the transaction and what your definition of a "transaction" really is. All too often, those stating this requirement will then go on to say that MongoDB has a problem because it does not support the XA protocol. They want their database products to magically take care of this complexity, under the covers. They want them to provide the guarantee of "strong" consistency across the multiple systems, without having to deal with this in their application code. If you're one of those people, I'm here to tell that these are not the ~~droids~~ protocols you are looking for.

Let me have a go at explaining why XA/2PC distributed transactions are not strongly consistent, This is based on what I've seen over the past 10 years or so, especially in the mid-2000s, when working with some UK government agencies and seeing this issue at first hand.

First of all, what are some examples of distributed transactions?

You've written a piece of application code that needs to put the same data into two different databases (eg. Oracle's database and IBM's DB2 database), all or nothing, as a single transaction. You don't want to have a situation where the data appears in one database but not in the other.
You've written a piece of application code that receives a message off a message queue (eg. IBM MQ Series) and inserts the data, contained in the message, into a database. You want these two operations to be part of the same transaction. You want to avoid the situation where the dequeue operation succeeds but the DB insert operation fails, resulting in a lost message and no updated database. You also want to avoid the situation where the database insert succeeds, but the acknowledgement of dequeue operation subsequently fails, resulting in the the message being re-delivered (a duplicate database insert of the same data would then occur).
Another "real-world" example that people [incorrectly] quote, is moving funds between two bank systems, where one system is debited, say £50, and the other is credited by the same amount, as a single "transaction". Of course you wouldn't want the situation to occur where £50 is taken from one system, but due to a transient failure, is not placed in the other system, so money is lost *. The reason I say "incorrectly" is that in reality, banks don't manage and record money transfers this way. Eric Brewer explains why this is the case in a much more eloquent way than I ever could. On a related but more existential note, Gregor Hohpe's classic post is still well worth a read: Your Coffee Shop Doesn’t Use Two-Phase Commit.

* although if you were the receiver of the funds, you might like the other possible outcome, where you receive two lots of £50, due to the occurrence of a failure during the 1st transaction attempt

So what's the problem then?

Back in the mid-2000s, I was involved in building distributed systems with application code running on a Java EE Application Server (WebLogic). The code would update a record in a database (Oracle) and then place a message on a queue (IBM MQ Series), as part of the same distributed transaction, using XA. At a simplistic level, the transactions performed looked something like this:

If the update to the DB failed, the enqueue operation to put the message onto the message queue would be rolled back, and vice versa. However, as with most real world scenarios, the business process logic was more involved than that. The business process actually had two stages, which looked like this:

Basically in the first stage of the process, the application code would put some data in the database and then put a message on a queue, as part of a single transaction, ready to allow the next stage of the process to be kicked off. The queuing system would already have a piece of application code registered with it, to listen for arrived messages (called a "message listener"). Once the message was committed to the queue, a new transaction would be initiated. In this new transaction the message would be given to the message listener. The listener's application code would receive the message and then read some of the data, that was previously inserted into the database.

However, when we load tested this, before putting the solution into production, the system didn't always work that way. Sporadically, we saw this instead:

How could this be? A previous transaction had put the data in the database as part of the same transaction that put the message on the queue. Only when the message was successfully committed to the queue, could the second transaction be kicked off. Yet, the subsequent listener code couldn't find the row of data in the database, that was inserted there by the previous transaction!

At first we assumed there was a bug somewhere and hunted for it in Oracle, MQ Series, WebLogic and especially our own code. Getting nowhere, we eventually started digging around the XA/2PC specification a little more, and we realised that the system was behaving correctly. It was correct behaviour to see such race conditions happen intermittently (even though it definitely wasn't desirable behaviour, on our part). This is because, even though XA/2PC guarantees that both resources in a transaction will have their changes either committed or rolled-back atomically, it can't enforce exactly when this will happen in each. The final commit action (the 2nd phase of 2PC) performed by each of those resource systems is initiated in parallel and hence cannot be synchronised.

The asynchronous nature of XA/2PC, for the final commit process, is by necessity. This allows for circumstances where one of the systems may have temporarily gone down between voting "yes" to commit and then subsequently being told to actually commit. If it is never possible for any of the systems to go down, there would be little need for transactions in the first place (quid pro quo). The application server controlling the transaction keeps trying to tell the failed system to commit, until it comes back online and executes the commit action. The database or message queue system can never 100% guarantee to always commit immediately, and thus only guarantees to commit eventually. Even when there isn't a failure, the two systems are being committed in parallel and will each take different and non-deterministic durations to fulfil the commit action (including the variable time it takes to persist to disk, for durability). There's no way of guaranteeing that they both achieve this in exactly the same instance of time - they never will. In our situation, the missing data would eventually appear in the database, but there was no guarantee that it would always be there when the code in a subsequent transaction tried to read it. Indeed, upon researching a couple of things while preparing this blog post, I discovered that even Oracle now documents this type of race condition (see section "Avoiding the XA Race Condition").

Back then, we'd inadvertently created the perfect reproducible test case for PROOF THAT XA/2PC IS ONLY EVENTUALLY CONSISTENT. To fix the situation, we had to put some convoluted workarounds into our application code. The workarounds weren't pretty and they're not something I have any desire to re-visit here.

There's more! When the solution went live, things got even worse...

It wasn't just the "C" in "ACID" that was being butchered. It turned out that there was a contract out to do a hatchet job on the "A" and "D" of "ACID" too.

In live production environments, temporary failures of sub-systems will inevitably occur. In our high throughput system, some distributed transactions will always be in-flight at the time of the failure. These in-flight transactions would then stick around for a while and some would be visible in the Oracle database (tracked in Oracle's "DBA_2PC_PENDING" system table). There's nothing wrong with this, except the application code that created these transactions will have been holding a lock on one or more table rows. These locks are a result of the application code having performed an update operation as part of the transaction, that has not yet been committed. In our live environment, due to these transactions being in-doubt for a while (minutes or even hours depending on the type of failure) a cascading set of follow-on issues would occur. Subsequent client requests coming into the application would start backing up, as they tried to query the same locked rows of data, and would get blocked or fail. This was due to the code having used the very common "SELECT ... FOR UPDATE" operation, which attempts to grab a lock on a row, ready for the row to be updated in a later step.

Pretty soon there would be a deluge of blocking or failing threads and the whole system would appear to lock up. No client requests could be serviced. Of course, the DBA would then receive a rush of calls from irate staff yelling that the mission critical database had ground to a halt. Under such time pressure, all the poor DBA could possibly do was to go to the source of the locks and try to release them. This meant going to Oracle's "pending transactions" system tables and unilaterally rolling back or committing each of them, to allow the system to recover and service requests again. At this point all bets were off. The DBA's decision to rollback or commit would have been completely arbitrary. Some of the in-flight transactions would have been partly rolled-back in Oracle, but would have been partly committed in MQ Series, and vice versa.

So in practice, these in-doubt transactions were neither applied Atomically or Durably. The "theory" of XA guaranteeing Atomicity and Durability was not under attack. However, the practical real-world application of it was. At some point, fallible human intervention was required to quickly rescue a failing mission critical system. Most people I know live in the real world.

My conclusions...

You can probably now guess my view on XA/2PC. It is not a panacea. Nowhere near. It gives developers false hope, lulling them into a false sense of security, where, at best, their heads can be gently warmed whilst buried in the sand.

It is impossible to perform a distributed transaction on two or more different systems, in a fully ACID manner. Accept it and deal with it by allowing for this in application code and/or in compensating business processes. This is why I hope MongoDB is never engineered to support XA, as I'd hate to see such a move encourage good developers to do bad things.

Footnote: Even if your DBA refuses to unilaterally commit or rollback transactions, when the shit is hitting the fan, your database eventually will, thus violating XA/2PC. For example, in Oracle, the database will unilaterally decide to rollback all pending transactions, older than the default value of 24 hours (see "Abandoning Transactions" section of Oracle's documentation).

Song for today: The Greatest by Cat Power

MongoDB's BI Connector and pushdown of SQL WHERE clauses

2015-12-16T07:53:00.002+00:00

[EDIT 05-Apr-2018: MongoDB BI Connector version 2+ uses a much more rich and powerful approach for pushing down SQL clauses to the database - for more info see here]

In previous posts I showed how to use SQL & ODBC to query data from MongoDB, via Windows clients and Linux clients. In this post, I want to explore what happens to the SQL statement when it is sent to the BI Connector and onto the MongoDB database. For example, is the SQL WHERE clause pushed down to the database to resolve?

Again, for these tests, I've used the same MOT UK car test results data set as my last two posts.

As I wanted to get a better insight into what MongoDB is doing under the covers to process SQL queries, I used the Mongo Shell to enable profiling for the MongoDB database holding the MOT data.

Then on my Linux desktop client, using the ODBC settings I'd configured in my last post, I fired up isql ready to start issuing queries against the MOT data set, via the BI Connector.

I submitted a SQL statement to query all cars with a recorded mileage of over 500,000 miles, selecting specific columns only.

> SELECT make, model, test_mileage FROM testresults WHERE test_mileage > 500000;

The results were correctly returned in isql, but I was more interested to see what MongoDB was asked to do on the server-side, to fulfil this request. So using the Mongo Shell I queried the system profile collection to show the last recorded entry, displaying the exact request that MongoDB had received as a result of the translated SQL query.

> db.system.profile.find({ns: "mot:testresults"}).sort({$natural: -1}).limit(1).pretty()

As you can see in the output, the BI Connector has indeed pushed down to the database, the WHERE clause, plus the projection to return only specific fields. The profiler output shows that this has been achieved by the BI Connector, by assembling an Aggregation Pipeline.

This is great to see. Most of the work to process the SQL query is being done at the database level, reducing the amount of data (less rows and less columns) that is returned to the ODBC client for final processing, and also enabling database indexes to be leveraged for maximum performance.

Song for today: End Come Too Soon by Wild Beasts

Accessing MongoDB data using SQL / ODBC on Linux

2015-12-15T09:36:00.000+00:00

[EDIT 05-Apr-2018: MongoDB BI Connector version 2+ uses a different mechanism for connecting to (not using a PostgreSQL driver) - for more info see here]

In my previous post I showed how generic Windows clients can use ODBC to query data from MongoDB using the new BI Connector. I'm not really a Windows type person though and feel more at home with a Linux desktop. Therefore, in this post, I will show how easy it is to use ODBC from Linux (Ubuntu 14.04 in my case) to query MongoDB data using SQL. I've used the same MOT UK car test results data set as my last post.

First of all I needed to install the Linux ODBC packages plus the PostgreSQL ODBC driver from the package repository.

$ sudo apt-get install unixodbc-bin unixodbc odbc-postgresql

Then I googled how to use ODBC and the PostgreSQL driver to query an ODBC data source. Surprisingly, I didn't find much quality information out there. However, looking at the contents of the "odbc-postgresql" package....

$ apt-file list odbc-postgresql

...it showed some bundled documents including...

/usr/share/doc/odbc-postgresql/README.Debian

Upon opening this text file I found pretty much everything I needed to know, to get going. I really should learn to RTFM more often!

The next step, as documented in that README, was to register the PostgreSQL ANSI & Unicode ODBC drivers in the /etc/odbcinst.ini file, using a pre-supplied template file that contained the common settings.

$ sudo odbcinst -i -d -f /usr/share/psqlodbc/odbcinst.ini.template

Then I needed to create the /etc/odbc.ini file where data sources can be registered. Here I used a skeleton template again.

$ sudo cat /usr/share/doc/odbc-postgresql/examples/odbc.ini.template >> ~/.odbc.ini
$ sudo mv .odbc.ini /etc/odbc.ini

However, this particular template only includes dummy configuration settings, so I needed to edit this file to register the remote MongoDB BI Connector's details properly.

$ sudo vi /etc/odbc.ini

$ cat /etc/odbc.ini
[mot]
Description = MOT Test Data
Driver = PostgreSQL Unicode
Trace = No
TraceFile = /tmp/psqlodbc.log
Database = mot
Servername = 192.168.43.173
UserName = mot
Password = mot
Port = 27032
ReadOnly = Yes
RowVersioning = No
ShowSystemTables = No
ShowOidColumn = No
FakeOidIndex = No
ConnSettings =

( forgive my poorly secure username/password of "mot/mot". ;) )

Now I was ready to launch the "isql" (Interactive SQL) tool, that is bundled with the Linux/UNIX ODBC package, to be able to issue SQL queries against my remote MongoDB BI Connector data source.

$ isql mot mot mot

( first param is data source name, second param is username, third param is password )

As you can see in the screenshot below, using "isql" I was then able to easily issue arbitrary SQL commands against MongoDB and see the results.

And that's it. Very simple, and I now have a powerful command line SQL tool at my disposal for further experiments in the future. :-)

Song for today: Swallowtail by Wolf Alice

Accessing MongoDB data from SQL / ODBC on Windows, using the new BI Connector

2015-12-11T12:29:00.000+00:00

[EDIT 05-Apr-2018: MongoDB BI Connector version 2+ uses a different mechanism for connecting to (not using a PostgreSQL driver) - for more info see here]

The latest enterprise version of MongoDB (3.2) includes a new BI Connector to enable business intelligence, analytics and reporting tools, that only "speak" SQL, to access data in a MongoDB database, using ODBC. Most of the examples published so far show how to achieve this using rich graphical tools, like Tableau. Therefore, I thought it would be useful to show here that the data is accessible from any type of tool, that is capable of issuing SQL commands via an ODBC driver. Even from Microsoft's venerable Excel spreadsheet application. Believe it or not, I still come across organisations out there that are using Excel to report on the state of their business!

For my example, I loaded a MongoDB database with the anonymised MOT tests results data, that the UK government makes freely available to use. As an explanation for non-residents of the UK, MOT tests are the annual inspections that all UK road-going cars and other vehicles have to go through, to be legal and safe. There are millions of these car test records recorded every year, and they give a fascinating insight into the types and ages of cars people choose to drive in the UK.

First I loaded the CSV file based MOT data sets into a MongoDB 3.2 database, using a small Python script I wrote, with each document representing a test result for a specific owner's car for a specific year. Below is an example of what one test result document looks like in MongoDB:

I then followed the online MongoDB BI Connector documentation to configure a BI Connector server to listen for ODBC requests for the "mot" database and translate these to calls to the underlying MongoDB "testresults" collection. I just used the default DRDL ("Document Relational Definition Language") schema file that was automatically generated by the "mongodrdl" command line utility (a utility bundled with the BI Connector).

Then, on a separate desktop virtual machine running Windows 10, I downloaded the latest PostgreSQL ODBC driver installer for Windows and installed it.

With the ODBC driver installed, I then proceeded to define a Windows ODBC Data Source to reference the MOT database that I was exposing via by the BI Connector (running on a remote machine).

By default the BI Connector (running on a machine with IP address 192.168.1.174, in my case), listens on port 27032. Before hitting the Save button, I hit the Test button to ensure that the Windows client could make a successful ODBC connection to the BI Connector.

With the new ODBC Data Source now configured (shown in screenshot above), I then launched Microsoft Excel so that I could use this new data source to explore the MOT test data.

Excel's standard query wizard was able to use the ODBC data source to discover the MongoDB collection's "schema". I chose to include all the "fields" in the query.

I thought it would be useful to ask for the MOT test results to be ordered by Test Year, followed by Car Make, followed by Car Model.

Finally, upon pressing OK, Excel presented me with the results of the SQL/ODBC query, run directly against the MOT test data, sourced from the MongoDB collection.

Excel, then gave me many options, including settings to say whether to periodically refresh the data from source, and how often. I was also able to open Microsoft's built-in Query Builder tool to modify the query and execute again

That's pretty much it. It's straight forward to configure a Windows client to access data from MongoDB, via MongoDB's new BI Connector, using ODBC.

Song for today: Dust and Disquiet by Caspian

Tracking Versions in MongoDB

2014-09-26T10:10:00.000+01:00

I'm honoured to have been asked by Asya to contribute a guest post to her series on Tracking Versions in MongoDB.

Here's Asya's full series on the topic, which I recommend reading in order:

Song for today: Someday I Will Treat You Good by Sparklehorse

Java Using SSL To Connect to MongoDB With Access Control

2014-09-11T11:54:00.002+01:00

In this post I've documented the typical steps you need to enable a Java application to connect to a MongoDB database over SSL. Specifically, I show the so-called "one-way SSL" pattern, where the server is required to present a certificate to the client but the client is not required to present a certificate to the server. Regardless of authentication, communication between client and server is encrypted in both directions. In this example, the client actually authenticates with the database, albeit using a username/password rather than presenting a certificate. In addition, I also show how the client application can run operations against a database that has access control rules defined for it.

Note: Ensure you are running the Enterprise version of MongoDB to be able to configure SSL.

1. Generate Key+Certificate and Configure With MongoDB For SSL

Create a key and certificate:

$ su -
$ cd /etc/ssl/
$ openssl req -new -x509 -days 365 -nodes -out mongodb-cert.crt -keyout mongodb-cert.key
$ cat mongodb-cert.key mongodb-cert.crt > mongodb.pem
$ exit

Add the following entries to mongodb.conf (or equivalent if setting command line options):

sslOnNormalPorts=true
sslPEMKeyFile=/etc/ssl/mongodb.pem

Start the mongod database server.

Test the SSL connection from the Mongo shell.

$ mongo --ssl

2. Configure Java Client For SSL

Create a new Java trust store in the local application's root directory, importing the certificate generated from the previous section. This is necessary because in this example a self-signed certificate is used.

$ keytool -import -alias "MongoDB-cert" -file "/etc/ssl/mongodb-cert.crt" -keystore truststore.ts -noprompt -storepass "mypasswd"
$ keytool -list -keystore truststore.ts -storepass mypasswd

In the client application's Java code, add the "ssl=true" parameter to the MongoDB URI to tell it to use an SSL connection, eg:

MongoClientURI uri = new MongoClientURI("mongodb://localhost:27017/test?ssl=true");

Modify the command line or script that runs the 'Java' executable for the client application, and add the following JVM command line property. This will allow the application to validate the server's certificate against the new trust store, when using SSL.

-Djavax.net.ssl.trustStore=truststore.ts

Run the Java client application to test that the SSL connection works.

3. Configure MongoDB database with Access Control Rules

Using the Mongo shell, connect to the database, and define an 'administrator user', plus a 'regular' user with an access control rule defined to enable it to have read/write access to a database (called 'test' in this example).

$ mongo --ssl
> use admin
> db.addUser({user: "admin", pwd: "admin", roles: [ "userAdminAnyDatabase"]})
> use test
> db.addUser({user: "paul", pwd: "password", roles: ["readWrite", "dbAdmin"]})

Add the following entry to mongodb.conf (or equivalent if setting command line options):

auth=true

Restart the mongod database server to pick up this change.

Test the SSL connection to the 'test' database with username/password authentication, from the Mongo shell.

$ mongo test --ssl -u paul -p password

If the user specified doesn't have permissions to read collections in the database, an error similar to below will occur when trying to run db.mycollection,find(), for example.

error: { "$err" : "not authorized for query on test.mycollection", "code" : 13 }

4. Configure Java Client To Authenticate

Modify the client application code to include the username and password in the URL, eg:

MongoClientURI uri = new MongoClientURI("mongodb://paul:password@localhost:27017/test?ssl=true");

Run the Java client application to test that using SSL with username/password authentication works and has the rights to access the sample database.

If the user specified doesn't have correct permissions, an exception similar to below will occur.

Exception in thread "main" com.mongodb.MongoException: not authorized for query on test.system.namespaces
at com.mongodb.MongoException.parse(MongoException.java:82)
at com.mongodb.DBApiLayer$MyCollection.__find(DBApiLayer.java:292)
at com.mongodb.DBApiLayer$MyCollection.__find(DBApiLayer.java:273)
at com.mongodb.DB.getCollectionNames(DB.java:400)
at MongoTest.main(MongoTest.java:25)

Note: In a real Java application, you would invariably build the URL dynamically, to avoid hard-coding a username and password in clear text in code.

Song for today: My Sister in 94 by The Paradise Motel

MongoDB Connector for Hadoop with Authentication - Quick Tip

2014-05-02T14:45:00.001+01:00

If you are using the MongoDB Connector for Hadoop and you have enabled authentication on your MongoDB database (eg. auth=true) you may find that you are prevented from getting data in to or out of the database.

You may have provided the username and password to the connector (eg. mongo.input.uri = "mongodb://myuser:mypassword@host:27017/mytestdb.mycollctn"), for an Hadoop Job that pulls data from the database. The connector will authenticate to the database successfully, but early in in the job run, the job will fail with an error message similar to the following:

14/05/02 13:17:01 ERROR util.MongoTool: Exception while executing job...

java.io.IOException: com.mongodb.hadoop.splitter.SplitFailedException: Unable to calculate input splits: need to login

at com.mongodb.hadoop.MongoInputFormat.getSplits(MongoInputFormat.java:53)

at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:493)

at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:510)

at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:394)

at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1295)

......

This is because the connector needs to run the MongoDB-internal splitVector DB command, under the covers, to work out how to split the MongoDB data up into sections ready to distribute across the Hadoop Cluster. However, by default, you are unlikely to have given sufficient privileges to the user, used by the connector, to allow this DB command to be run. This issue can be simulated easily by opening a mongo shell against the database, authenticating with your username and password and then running the splitVector command manually. For example:

> var result = db.runCommand({splitVector: 'mytestdb.mycollctn', keyPattern: {_id: 1},
maxChunkSizeBytes: 32000000})

> result

{

"ok" : 0,

"errmsg" : "not authorized on mytestdb to execute command {
splitVector: "mytestdb.mycollctn", keyPattern: { _id: 1.0 },
maxChunkSizeBytes: 32000000.0 }",

"code" : 13

}

To address this issue, you first need to use the mongo shell, authenticated as your administration user, and run the updateUser command to give the connector user the clusterManager role, to enable the connector to run the DB commands it requires. For example:

use mytestdb
db.updateUser("myuser", {
roles : [
{ role: "readWrite", db: "mytestdb" },
{ role : "clusterManager", db : "admin" }
]
})

After this, your Hadoop jobs with the connector should run fine.

Note: In my test, I ran Cloudera CDH version 5, MongoDB version 2.6 and Connector version 1.2 (built with target set to 'cdh4').

Song for today: Spanish Sahara by Foals