Sunday, October 4, 2020

Rust & MongoDB - Perfect Bedfellows

I've been learning Rust over the last month or so and I'm really enjoying it. It's a really elegant and flexible programming language despite being the most strongly typed and compile-time strict programming language I've ever used (bearing in mind I used to be a professional C & C++ developer way back in the day). 

I'd recently read the really good and commonly referenced blog post Creating a REST API in Rust with warp, which shows how to create a simple example Groceries stock management REST API service, and which uses an in-memory HashMap as its backing store. As part of my learning I thought I'd have a go at porting this to use MongoDB as its data store instead, using the fairly new MongoDB Rust Driver.

It turns out that this was really easy to do, also due to how well engineered the new MongoDB Rust Driver turned out to be, with its rich yet easy to use API. 

You can see my resulting MongoDB version of this sample Groceries application, in the Github project rust-groceries-mongo-api I created. Check out that project link to view the source code showing how MongoDB was integrated with for the Groceries REST API and how to test the application using a REST client.

What was even more surprising was how easy it was to integrate MongoDB's flexible data model with a programming language as strict as Rust, and I encountered no friction between the two at all. In fact, this was even easier to achieve by leveraging the option of using the driver team's additional contribution of BSON translation to the open source Rust Serde framework, which makes it easy to serialize/deserialize Rust data structures to/from other formats (e.g. JSON, Avro and now BSON).

I plan to blog again in the future, in more detail, about how to combine Rust's strict typing and MongoDB's flexible schema, especially when the data model and consuming microservices inevitably change over time.

Sunday, May 3, 2020

Converting Gnarly Date Strings to Proper Date Types Using a MongoDB Aggregation Pipeline

I recently received some example bank payments data in a CSV file which had been exported from a relational database with that database's default export settings. After using mongoimport to import this data 'as-is', into a MongoDB database, I noticed that there was a particularly gnarly date string field in each record. For example:
  • 23-NOV-20
Why do I say gnarly? Well if you lived through Y2K you should be horrified by the 'year' field shown above. How would you know from the data, without any context, what century this applies to? Is it 1920? Is it 2020? Is it 2120? There's no way of knowing from just the exported data alone. Also, there is no indication of which time zone this applies to. Is it British Summer Time? Is it Eastern Daylight Time? Who knows? Also the month element appears to be an abbreviation of a month expressed in a specific spoken language. Which spoken language?

I needed to get this into a proper Date type in MongoDB so I could then easily index it, perform date range queries natively, perform sort by date natively, etc.. My usual tool of choice for this is MongoDB's Aggregation pipeline to generate a new collection from the existing collection with the 'date' string fields converted to proper date type fields. To perform the string to date conversion, the usual operator of choice to use is $dateFromString (introduced in MongoDB 3.6). 

However, $dateFromString [rightly] expects an input string which isn't missing crucial date related text, indicating things like the century or timezone. Also, the $dateFromString operator contains no format specifiers to indicate that the text 'NOV' maps to the 11th month of a year in a specific spoken language.

Therefore, armed with the extra context of knowing this exported data refers to dates in the 21st century (the '2000s') with a UTC 'time zone' and in the English language (only inferred by asking the owner of the data), I had to perform some additional string manipulation in the aggregation pipeline before using $dateFromString to generate a true and accurate date type. The rest of this blog post shows how I achieved this for date strings like '23-NOV-20'.

Converting Incomplete Date Strings to Date Types Example

In the Mongo Shell targeting a running MongoDB test database, run the following code to insert 12 sample 'payment' records, with example 'bad date string' fields for testing each month of a sample year.

use test;
  {'account_id': '010101', 'pymntdate': '01-JAN-20', 'amount': 1.01},
  {'account_id': '020202', 'pymntdate': '02-FEB-20', 'amount': 2.02},
  {'account_id': '030303', 'pymntdate': '03-MAR-20', 'amount': 3.03},
  {'account_id': '040404', 'pymntdate': '04-APR-20', 'amount': 4.04},
  {'account_id': '050505', 'pymntdate': '05-MAY-20', 'amount': 5.05},
  {'account_id': '060606', 'pymntdate': '06-JUN-20', 'amount': 6.06},
  {'account_id': '070707', 'pymntdate': '07-JUL-20', 'amount': 7.07},
  {'account_id': '080808', 'pymntdate': '08-AUG-20', 'amount': 8.08},
  {'account_id': '090909', 'pymntdate': '09-SEP-20', 'amount': 9.09},
  {'account_id': '101010', 'pymntdate': '10-OCT-20', 'amount': 10.10},
  {'account_id': '111111', 'pymntdate': '11-NOV-20', 'amount': 11.11},
  {'account_id': '121212', 'pymntdate': '12-DEC-20', 'amount': 12.12}

Then execute the following Aggregation pipeline to copy the contents of the 'rawpayments' collection, populated above, into a new collection named 'payments', but with the 'pymntdate' field values converted from string types to date types.

  {$set: {
    pymntdate: {
      $dateFromString: {format: '%d-%m-%Y %H.%M.%S.%L', dateString:
        {$concat: [
          {$substrCP: ['$pymntdate', 0, 3]},  // USE FIRST 3 CHARS IN DATE STRING
          {$switch: {branches: [  // REPLACE MONTH 3 CHARS IN DATE STRING WITH 2 DIGIT MONTH
            {case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'JAN']}, then: '01'},
            {case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'FEB']}, then: '02'},
            {case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'MAR']}, then: '03'},
            {case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'APR']}, then: '04'},
            {case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'MAY']}, then: '05'},
            {case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'JUN']}, then: '06'},
            {case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'JUL']}, then: '07'},
            {case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'AUG']}, then: '08'},
            {case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'SEP']}, then: '09'},
            {case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'OCT']}, then: '10'},
            {case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'NOV']}, then: '11'},
            {case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'DEC']}, then: '12'},
           ], default: 'ERROR'}},
  {$out: 'payments'}

In this pipeline, the string '23-NOV-20' will be converted to 'ISODate("2020-11-23T22:57:36.827Z")' by concatenating the following four elements of text together before passing it to the $dateFromString operator to convert to a date:
  1. '23-' (from the input string)
  2. '11' (replacing 'NOV')
  3. '-20' (hard-coded hyphen + century)
  4. '20' (the rest of input string apart from last 6 nanosecond digits)
Note: A $set stage is used in this pipeline, which is a type of stage first introduced in MongoDB 4.2. $set is an alias for $addFields, so if using an earlier version of MongoDB, replace $set with $addFields in the pipeline.

To see what the converted records look like, containing new date types, query the new collection:

db.payments.find({}, {_id:0});

Which will show the following results:

{ "account_id" : "010101", "pymntdate" : ISODate("2020-01-01T01:01:01.123Z"), "amount" : 1.01 }
{ "account_id" : "020202", "pymntdate" : ISODate("2020-02-02T02:02:02.456Z"), "amount" : 2.02 }
{ "account_id" : "030303", "pymntdate" : ISODate("2020-03-03T03:03:03.789Z"), "amount" : 3.03 }
{ "account_id" : "040404", "pymntdate" : ISODate("2020-04-04T04:04:04.012Z"), "amount" : 4.04 }
{ "account_id" : "050505", "pymntdate" : ISODate("2020-05-05T05:05:05.345Z"), "amount" : 5.05 }
{ "account_id" : "060606", "pymntdate" : ISODate("2020-06-06T06:06:06.678Z"), "amount" : 6.06 }
{ "account_id" : "070707", "pymntdate" : ISODate("2020-07-07T07:07:07.901Z"), "amount" : 7.07 }
{ "account_id" : "080808", "pymntdate" : ISODate("2020-08-08T08:08:08.234Z"), "amount" : 8.08 }
{ "account_id" : "090909", "pymntdate" : ISODate("2020-09-09T09:09:09.567Z"), "amount" : 9.09 }
{ "account_id" : "101010", "pymntdate" : ISODate("2020-10-10T10:10:10.890Z"), "amount" : 10.1 }
{ "account_id" : "111111", "pymntdate" : ISODate("2020-11-11T11:11:11.111Z"), "amount" : 11.11 }
{ "account_id" : "121212", "pymntdate" : ISODate("2020-12-12T12:12:12.999Z"), "amount" : 12.12 }

Song for today: For Everything by The Murder Capital

Sunday, December 29, 2019

Running MongoDB on ChromeOS (via Crostini)

In my previous post I explored Linux application support in ChromeOS and Chromebooks (a.k.a. Crostini). Of course I was bound to try running MongoDB in this environment, which I found to work really well (for development purposes). Here's my notes on running a MongoDB database and tools on a Chromebook with Linux (beta) enabled:
  • In ChromeOS, launch the Terminal app (which opens a Shell inside the 'Penguin' Linux container inside the 'Termina' Linux VM)
  • Run the following commands which are documented in the MongoDB Manual page on installing MongoDB Enterprise on Debian (following the manual's tab instructions titled “Debian 9 "Stretch”):
wget -qO - | sudo apt-key add -
echo "deb stretch/mongodb-enterprise/4.2 main" | sudo tee /etc/apt/sources.list.d/mongodb-enterprise.list
sudo apt-get update
sudo apt-get install -y mongodb-enterprise
  • Start a MongoDB database instance running:
mkdir ~/data
mongod --dbpath ~/data
  • Launch a second Terminal window and then run the Mongo Shell against this database and perform a quick database insert and query test:

  • Install Python 3 and the PIP Python package manager (using Anaconda) and then install the MongoDB Python driver (PyMongo):
bash Anaconda3-*
source ~/.bashrc
python --version
pip --version
pip install --user pymongo
  • Test PyMongo by running a small ‘payments data generator’ Python script pulled down from a GitHub repository (this should insert records into the MongoDB local database’s “fs.payments” collection; after letting it run for a minute, continuously inserting new records, press Ctrl-C to stop it):
git clone
cd PaymentsWriteReadConcerns/
./ -p 1
  • Download MongoDB Compass (use the Ubuntu 64-bit 14.04+ version), install and run it against the 'localhost' MongoDB database and inspect the contents of the “fs.payments” collection:
sudo apt install ./mongodb-compass_*_amd64.deb

Song for today: Sun. Tears. Red by Jambinai

My Notes on Linux Application Support in ChromeOS (a.k.a. Crostini)

These are my own rough notes from spending a few days studying Chrome OS and its Linux app support on a HP Chromebook 14* I got for free (retails for about £150) when I recently purchased a Google Pixel 4 Android mobile phone. I thought I’d share the notes in case they are of use to others. I’m sure there needs to be some corrections, so feedback is welcome.
 * released: 2019, model: db0003na, codename: careena, board: grunt

Some references to other articles that I used to bootstrap my knowledge:

Below are some screenshots showing the ChromeOS Settings section where “Linux (beta)” (a.k.a. Crostini) can be enabled and the Linux apps that are then installed by default when (essentially just the GNOME Help application and the Terminal application, from which many other Linux apps can subsequently be installed):

Here is a diagram I put together to attempt to capture the architecture of Crostini in ChromeOS as I understand it (the rest of this document digs into the details behind some of these layers):

ChromeOS & Crostini

  • Under the covers, ChromeOS is based on Gentoo and the Portage package manager
  • crosh (ChromeOS Developer Shell) is the pluggable command line shell/terminal for ChromeOS (in the Chrome browser, enter Ctrl-Alt-T to launch crosh inside a browser tab)
  • Crostini is the term for Linux application support in ChromeOS which manages the specific Linux VM and then the specific Linux container inside it, managing the lifecycle of when to launch them, mounting the filesystem to show the container’s files in the ChromeOS Files app, etc.. Crostini provides easy to use Linux application support integrated directly into the running ChromeOS desktop, rather than, for example, needing to dual boot or having to run a separate Linux VM and needing to explicitly switch, via the desktop, between ChromeOS and the Linux VM.
  • ChromeOS also has a Developer mode (verification is disabled when the OS boots) which is a special mode built into all Chromebooks to allow users and developers to access the code behind the Chrome Operating System and load their own builds of ChromeOS. This mode also allows users to install and run another Linux system like Ubuntu instead of ChromeOS (i.e. dual boot), but still have ChromeOS available to boot into too
  • As an alternative to Crostini, in addition to the dual-boot option, developer mode can also be used for Crouton which is a set of scripts that bundle up a chroot generator/environment to run both ChromeOS and Ubuntu at the same time. Here a Linux OS runs alongside ChromeOS, so users can switch between the ChromeOS desktop and Linux desktops via a keyboard shortcut. This gives users the ability to take advantage of both environments without needing to reboot. Unlike with virtualisation, a second OS is not being booted and instead the guest OS is running using the Chromium OS system. As a result any performance penalty is reduced because everything is run natively, and RAM is not being wasted to boot two OSes at the same time. Note, Crostini is different than this Crouton capability, as it enables the Linux shell and apps to be brought into the platform in verified (non-developer) mode with seamless user interface desktop integration and multi-layered security, in a supported way.
  • To use Crostini, from the ChromeOS Settings select ‘Linux (Beta)’ and choose to enable it, which, behind the scenes, will download and configure a specific Linux VM containing a specific Linux Container (see the next sections for more details) and it adds a launcher group to the ChromeOS desktop called ‘Linux Apps’. This launcher group includes a launcher to run a Linux shell/terminal application, called Terminal, which is displayed in the ChromeOS desktop but is connected directly inside the container

Crostini Linux VM Layer

  • crosvm (ChromeOS Virtual Machine Monitor) is a custom virtual machine manager written in Rust that runs guest VMs via Linux's KVM hypervisor virtualisation layer and manages the low-level virtual I/O device communication (Amazon’s Firecracker is a fork of crosvm)
  • A specific VM is used to run a container rather than ChromeOS running a container directly, for security reasons because containers do not provide sufficient security isolation on their own. With the two layers, an adversary has to exploit crosvm via its limited interactions with the guest, in addition to the container, and the VM itself is heavily sandboxed.
  • The VM (and its container) are tied to a ChromeOS login session and as soon as a user logs out, all programs are shut down/killed by design (all user data lives in the user’s encrypted home to ensure nothing is leaked when a user log out). The VM, container and their data are persisted across user sessions and are kept in the same per-user encrypted storage as the rest of the browser's data.
  • KVM generally (rather than Crostini specifically) can execute multiple virtual machines running unmodified Linux or Windows images. Each virtual machine has private virtualised hardware: a network card, disk, graphics adapter, etc. The kernel component of KVM is included in mainline Linux codebase and the userspace component of KVM is included in mainline QEMU codebase
  • Termina is the VM launched by crosvm and is based on a ChromeOS (CrOS) image with a stripped-down ChromeOS Linux kernel and userland tools. The main goal is to just boot up Termina as quickly as possible, as a secure sandbox, and start running containers.
  • Currently, other custom VMs (other Linux variants, Windows, etc) cannot be run and only instances of the Termina VM image can be booted, although multiple VM instances can be run simultaneously based on the Termina image
  • vmc is the crosh command line utility to manually manage custom VM instances via Concierge (the ChromeOS daemon that manages VM/container life cycles)
  • To view the registered VM(s) from crosh (Ctrl-Alt-T), which may or may not be running, run:
vmc list
  • To launch the Termina VM as a VM instance called ‘termina’ and open a shell directly in the VM, run:
vmc start termina
  • With the above command, the default container in the VM will not be started automatically. However, instead, if from the ChromeOS desktop, a Linux Shell (Terminal) or other Linux App is launched (or the ‘Linux files’ app, Files , is launched) the Termina VM is automatically launched and the default container it owns is also automatically started
  • If the Termina VM is already running, to connect to it via a shell, run:
vsh termina
  • If the ‘vmc start’ command is run with a different VM name, a new VM of that name will be created, launched and its shell entered from the existing terminal command line. This will use the same Termina image, and when running, ‘vmc list’ with list both VMs (the new instance doesn’t have any containers defined in it by default, ready to run, unlike the main Termina VM)
  • To stop the main Termina VM, run:
vmc stop termina

Crostini Container Layer

  • The Termina VM only supports running containers using the “Linux Containers” (LXC) technology at the moment and doesn’t support Docker or other container technologies
  • The default container instance launched via Termina is called Penguin and is based on Debian 9 with some custom packages
  • Containers are run inside a VM rather than programs running directly in the VM to help keep VM startup times low, to help improve security sandboxing by providing a stateless immutable VM image and to allow the container, its applications and their dependencies to be maintained independently from the VM, which otherwise may have contradicting dependecy requirements
  • LXC, generally, works in the vanilla Linux kernel requiring no additional patches to be applied to the kernel source and uses various kernel features to contain processes including kernel namespaces (ipc, uts, mount, pid, network and user), Apparmor and SELinux profiles, Seccomp policies, chroots (using pivot_root), CGroups (control groups). LXCFS provides the userspace (FUSE) filesystem providing overlay files for cpuinfo, meminfo, stat and uptime plus a cgroupfs compatible tree allowing unprivileged writes.
  • LXD is a higher level container framework, which Crostini uses and LXD uses its own specific image formats and also provides the ability to manage containers remotely. Although LXD uses LXC under the covers, it is based on more than just LXC. The Termina VM is configured to run the LXD daemon. Confusingly, the command line tool for controlling LXD is called ‘lxc’ (the ‘LXD Client). If users are using LXD commands to manage containers, they should avoid using any commands that start with ‘lxc-’ as these are lower level LXC commands. Users should avoid mixing and matching the use of both sets of commands in the same system. Crostini uses LXD to launch the Penguin container and LXD is configured to only allow unprivileged containers to be run, for added security. Therefore with Crostini, users should not use the lower level ‘lxc-’ commands because these can’t manage the LXD derived containers that Crostini uses. By default, LXD comes with 3 remote repositories providing images: 1) ubuntu: (for stable Ubuntu images), 2) ubuntu-daily: (for daily Ubuntu images), and 3) images: (for other distros)
  • In the Termina VM, the full LXC/LXD capabilities are provided, and remote images for many types of distros can be used to spawn multiple containers, in addition to the main Penguin container (these are not tested or certified though so may or may not work correctly)
  • Sommelier (a Wayland proxy compositor provides seamless X forwarding integration for content, input events, clipboard data, etc... between Linux apps and the ChromeOS desktop) and Garcon (a daemon for passing requests between the container and ChromeOS) binaries are bind-mounted into the main Penguin container. The Penguin container’s systemd is automatically configured to start these daemons. The libraries for these daemons are already present in the Penguin container LXD image used for Penguin (‘google:debian/stretch’). Other LXD containers launched in the VM don't seem to be enabled for their X based GUI apps to be displayed in the ChromeOS desktop, even if they use the special ‘google:debian/stretch’ LXD container image as it seems Crostini won’t attempt to integrate with this at runtime. Note: Some online articles imply it may be possible to get X-forwarding working from multiple containers.
  • In the Penguin container (which users can access directly, via the Terminal app launcher in the ChomeOS desktop), users can query the IP address of the container which is accessible from ChromeOS and can then run crosh (Ctrl-Alt-T) in ChromeOS and ping the IP address of the container directly. Users can also SSH from the ChromeOS desktop to the Penguin container using Google’s official SSH client that can be installed in Chrome via Chrome Web Store
  • If other containers are launched and then Google’s official SSH client is installed in ChromeOS (install ‘Secure Shell Extension’ via the Chrome Web Store), users can then define SFTP mount-points to other non-Penguin containers and the files in these containers will automatically appear in the Files app too 
  • From the Termina VM, users can use the standard LXD lxc command line tool to list containers and then to see if the Penguin container is running, by running:
lxc list
lxc info penguin | grep "Status: "

  • To check the logs for the Penguin container, run:
lxc info --show-log penguin
  • To open a command line shell as root in the running container (note, the Terminal app has a different identity for connecting to the Penguin container, which is a non-root user), run
lxc exec penguin -- /bin/bash
  • Within the Penguin container you can run GUI apps which automatically display in the main ChromeOS user interface. For example to install the GEdit text editor Linux application run the following (which also adds a launcher for GEdit in the ChromeOS desktop ‘Linux Apps’ launcher group):
sudo apt install gedit

  • It is even possible to install and run a new Google Chrome browser installation from the Linux container, by running the following (which also adds a launcher for this Linux version of Chrome in the ChromeOS desktop ‘Linux Apps’ launcher group):
sudo apt install ./google-chrome-stable_current_amd64.deb

  • From crosh (Ctrl-Alt-T), it is also possible to start the main container in the main VM (if not already started) and then connect a shell directly to the main container in the main VM, by running
vmc container termina penguin
vsh termina penguin

Playing with Custom Containers

  • First of all launch crosh (Ctrl-Alt-T), and connect a shell to the Termina VM:
vsh termina
  • Import Google’s own image repository into LXD to include the special Debian image used by Penguin:
lxc remote list
lxc remote add google --protocol=simplestreams
lxc remote list
lxc image list google:
lxc image info google:debian/stretch

  • Launch and test a container using Google’s special Debian 9 image:
lxc launch google:debian/stretch mycrosdebiancontainer
lxc list
lxc exec mycrosdebiancontainer -- /bin/bash
cat /etc/*elease*
apt update && apt upgrade -y

  • Launch and test a container using a standard Ubuntu 18.04 image:
lxc launch ubuntu:18.04 myubuntucontainer
lxc list
lxc exec myubuntucontainer -- /bin/bash
cat /etc/*elease*
apt update && apt upgrade -y

  • Launch and test a container using a standard Centos 7 image:
lxc launch images:centos/7 mycentoscontainer
lxc list
lxc exec mycentoscontainer -- /bin/bash
cat /etc/*elease*
yum -y update

  • If the Chromebook is rebooted and the Termina VM restarted, these 3 containers still exist as they are persisted, but they will be in a stopped state. When the containers are then manually restarted they will still have the same settings, files and modifications that were made before they were stopped. To start a stopped container run (example shown for one of the containers):
lxc start myubuntucontainer
  • None of the containers launched above seem to enable GUI apps (e.g. GEdit) to be forwarded automatically to the ChromeOS desktop. Even though the ‘google:debian/stretch’ based container has the relevant X forwarding libraries bundled, it doesn't seem to be automatically integrated with at runtime by the Crostini framework to enable X forwarding
  • Another way to launch a new container is to use one of the following commands, although, again, neither seem to automatically configure X-forwarding, even though they use the ‘google:debian/stretch’ image. It seems that only the Penguin container specifically is beiung managed by Crostini and has X forwarding configured (the first command below should be launched from ChromeOS crosh, the second command which is deprecated performs the same action but should be run from inside the Termina VM:
vmc container termina mycontainer --container_name=mycontainer --user=jdoe --shell

  • Note, this may throw a timeout error similar to below, but the containers do seem to be created ok:
Error: routine at frontends/ `container_create(vm_name,user_id_hash,container_name,image_server,image_alias)` failed: timeout while waiting for signal

Song for today: The Desert Song, No.2 - live by Sophia

Thursday, December 19, 2019

Some Tips for Diagnosing Client Connection Issues for MongoDB Atlas


   [UPDATE 07-Sep-2020: I've now written an executable binary tool you can run which performs the equivalent of the checks in this blog post to diagnose connectivity issues to Atas or any other type of MongoDB deployment, downloadable from here]

By default, for recent MongoDB drivers and client tools, MongoDB Atlas advertises the exposed URL for a deployed database cluster using a service name which maps to a set of DNS SRV records to provide an initial connection seed list. This results in a much more 'human digestible' URL, but more importantly, increases deployment flexibility and the ability for underlying database server hosts to migrate over time, without needing to subsequently reconfigure clients.

For example, an Atlas Cluster may be referenced in a connection string by: an alternative to the full connection endpoint list:,,

It is worth noting though, whichever approach is used (explicitly defining all endpoints in the connection string or having it discovered via the DNS SRV service name), the connection URL seed list is only ever used for bootstrapping a client application to the database cluster, when the client first starts or when it later needs to restart. On start-up, the client uses the connection seed list to attempt to attach to any member of the cluster, and in fact, all but one of the endpoints could be incorrect and a successful cluster connection will still be achieved. Once the initial connection is made, the true cluster member endpoint list is dynamically and continuously shared between the cluster and the client at runtime. This enables the client to continue operating against the database even if the members of the database cluster change locations or identities over time. For example, after a year of a database cluster and application continuously running, there could be the need to increase database capacity by dynamically rotating the database hosts to new higher processing capacity machines. This all happens dynamically and the already running client application automatically becomes aware and leverages the new hosts without downtime and without needing to consult the connection string again. If the client application restarts though, it will need to read the updated connection string to be able to bootstrap a connection back up to the database cluster.

In the rest of this post we will explore some of the ways initial client connectivity issues can be diagnosed and resolved when using DNS SRV based connection URLs. For reference, Joe Drumgoole provides a great explanation about how DNS SRV records work more generally, and how MongoDB drivers and tools can leverage these.

Naive Connectivity Diagnosis

If you are having connection problems with Atlas when using the SRV service name based URL, be weary of drawing the wrong conclusions regarding the cause of the connection problem...

For example, lets say you can't connect an application to a cluster with the Atlas advertised URL of 'mongodb+srv://' from your laptop. You may be tempted to try to debug the connection problem by running some of the following commands from your laptop:

$ ping
ping: Name or service not known

$ nc -zv -w 5 27017
nc: getaddrinfo for host "" port 27017: Name or service not known

Neither of these work even if you actually do have Atlas connectivity configured correctly. This is because "" is not the DNS name of a specific host endpoint. It is actually used by the MongoDB drivers and tools to dynamically lookup the DNS SRV records which have been populated for a service called ''.

Useful Connectivity Diagnosis

As documented in the MongoDB Drivers specification document and the MongoDB Manual, a DNS SRV query is performed by the drivers/tools by prepending the text '_mongodb._tcp.' to the service name. Therefore, to lookup the list of real endpoints for the Atlas cluster from your laptop using the DNS nslookup tool, you should run:

$ nslookup -q=SRV

Non-authoritative answer: service = 0 0 27017 service = 0 0 27017 service = 0 0 27017

You can see that in this case that the database service name maps to 3 endpoints (i.e. the hosts of the 3 replica set members). You can then lookup the actual IP address of any one of these endpoints if you desire:

$ nslookup

Non-authoritative answer: canonical name =

So to now debug your connectivity issue further you can use ping but this time by specifying one of the underlying host server endpoints for the database cluster:

$ ping -c 3
PING ( 56(84) bytes of data.
64 bytes from ( icmp_seq=1 ttl=51 time=10.2 ms
64 bytes from ( icmp_seq=2 ttl=51 time=9.73 ms
64 bytes from ( icmp_seq=3 ttl=51 time=11.7 ms

--- ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2002ms
rtt min/avg/max/mdev = 9.739/10.586/11.735/0.850 ms

If this is successful it still doesn't necessarily mean that you can connect to the database service. The next thing to try is to see if you can actually open a socket connection to the mongod (or mongos) daemon process running on one of the endpoints, which you can achieve from your laptop using the netcat utility:

$ nc -zv -w 5 27017
nc: connect to port 27017 (tcp) timed out: Operation now in progress

If this doesn't connect but you are able to ping the endpoint host (as is the case in this example), it probably indicates that the IP address of your client laptop has not been added to the Atlas project's whitelist, which is easy to remedy via the Atlas Console:

Once your laptop has been added to the whitelist, running netcat again should demonstrate that a socket connection can now be successfully made:

$ nc -zv -w 5 27017
Connection to 27017 port [tcp/*] succeeded!

If this connects, then it is advisable to move on to trying to connect to the database via the Mongo Shell.

In this example screenshot, the Atlas console suggests the following Mongo Shell command line to use to connect:

 mongo "mongodb+srv://" --username main_user

With this connection string, some of you may be thinking how does the Shell know to connect to Atlas over SSL/TLS, what replica-set name it should request and what authentication source database it should specify to locate the user's credentials?

Well, in addition to querying the DNS SRV records for the service, when dynamically constructing the initial bootstrap URL for the cluster, the MongoDB drivers/tools also lookup a DNS TXT record for the service which Atlas also populates for the deployed cluster. This TXT record contains the set of connection options, to be added as parameters to the dynamically constructed connecting string (e.g. 'ssl=true&replicaSet=TestCluster-shard-0&authSource=admin'). You can view what these parameter settings are for a particular Atlas cluster, yourself, by running the following DNS query:

$ nslookup -q=TXT

Non-authoritative answer:  text = "authSource=admin&replicaSet=TestCluster-shard-0"

Note, the default behaviour for MongoDB drivers/tools using a 'mongodb+srv' based URL is defined as to enable SSL/TLS for the connection. As a result, 'ssl=true' doesn't have to be included in the DNS TXT record, as shown in the example above, because the drivers/tools will automatically add this parameter to the connection string on the fly.


There's other potential causes of MongoDB Atlas connectivity issues that aren't covered in this post, but hopefully the tips highlighted here will help some of you, especially if you are diagnosing problems when using DNS SRV based service names in the connection URLs you use.

Song for today: Lose the Baby by Tropical Fuck Storm

Saturday, May 11, 2019

Running a Mongo Shell Script From Within A Larger Bash Script

If you have a Bash script that amongst other things needs to execute a set of multiple Mongo Shell commands together, there are a number of approaches that can be taken. This blog post contains nothing revelatory, but hopefully at least captures examples of these approaches in one single place for easy future reference. There are many situations where this is required, for example:
  • From within a Docker container image’s Entrypoint, running a Bash script which includes a section of Mongo Shell JavaScript code to configure a MongoDB replica-set, using rs.initiate() and associated commands.
  • From within a Continuous Integration process, running a Bash script which installs a MongoDB environment in a host Operating System (OS) and then populates the new MongoDB database with some sample data, using a set of Mongo Shell CRUD commands
  • From within a host system’s monitoring Bash script, which, in addition to gathering some host OS metrics, invokes a set of MongoDB’s server status and statistics commands to also capture database metrics.
The rest of this blog post shows some of the different approaches that can be taken to execute a block of Mongo Shell JavaScript code from within a larger Bash script. In these specific examples a trivial block of JavaScript code will insert 2 records into a ‘persons’ database collection, then query and print both the records belonging to the collection and then remove the 2 records from the collection.

It is worth noting that there is a difference in some of Mongo Shell’s behaviour when running a block of JavaScript code in the Mongo Shell’s Scripted mode rather than its Interactive mode, including the inability to run the Shell Helper commands (e.g. unable to utilise use db, show collections, etc.).


This option requires executing a separate file which contains the block of JavaScript code. First create a new JavaScript file called test.js with the following content:

db = db.getSiblingDB('testdb');
db.persons.insertOne({'firstname': 'Sarah', 'lastname': 'Smith'});
db.persons.insertOne({'firstname': 'John', 'lastname': 'Jones'});
db.persons.find({}, {'_id': 0, 'firstname': 1}).forEach(printjson);

Then create, make executable, and run a new Bash .sh script file with the following content (this will run the Mongo Shell in Scripted mode):

echo "Doing some Bash script work first"
mongo --quiet ./test.js
echo "Doing some more Bash script work afterwards"


This option involves executing the Mongo Shell with its eval option, passing in a single line containing each of the JavaScript commands separated by a semicolon. Create, make executable, and run a new Bash .sh script file with the following content (this will run the Mongo Shell in Scripted mode):

echo "Doing some Bash script work first"
mongo --quiet --eval "db = db.getSiblingDB('testdb'); db.persons.insertOne({'firstname': 'Sarah', 'lastname': 'Smith'}); db.persons.insertOne({'firstname': 'John', 'lastname': 'Jones'}); db.persons.find({}, {'_id': 0, 'firstname': 1}).forEach(printjson); print(db.persons.remove({}));"
echo "Doing some more Bash script work afterwards"

Note: Depending on your desktop resolution, your browser may show the Mongo Shell command wrapping onto multiple lines. However, it is actually just a single line, which can be proved by copying the line into a text editor which has its ‘text wrapping’ feature disabled.


This option involves executing the Mongo Shell with its eval option, passing in a block of multiple lines of JavaScript code, where the start and end of the code block are delimited by single or double quotes. Create, make executable, and run a new Bash .sh script file with the following content (this will run the Mongo Shell in Scripted mode):

echo "Doing some Bash script work first"
mongo --quiet --eval "
    db = db.getSiblingDB('testdb');
    db.persons.insertOne({'firstname': 'Sarah', 'lastname': 'Smith'});
    db.persons.insertOne({'firstname': 'John', 'lastname': 'Jones'});
    db.persons.find({}, {'_id': 0, 'firstname': 1}).forEach(printjson);
echo "Doing some more Bash script work afterwards"

Note: Care has to be taken to ensure that any quotes used within the JavaScript code block are single-quotes, if the Mongo Shell’s eval delimiters are double-quotes, or vice versa.


This option involves redirecting the content of a block of JavaScript multi-line code into the standard input (‘stdin’) stream of the Mongo Shell program, using a Bash Here-Document. Create, make executable, and run a new Bash .sh script file with the following content (unlike the other approaches this will run the Mongo Shell in Interactive mode):

echo "Doing some Bash script work first"
mongo --quiet <<EOF
    show dbs;
    db = db.getSiblingDB("testdb");
    db.persons.insertOne({'firstname': 'Sarah', 'lastname': 'Smith'});
    db.persons.insertOne({'firstname': 'John', 'lastname': 'Jones'});
    db.persons.find({}, {'_id': 0, 'firstname': 1}).forEach(printjson);
echo "Doing some more Bash script work afterwards"

In this case, because the Mongo Shell is run in Interactive mode, the output of the script will be more verbose. Also, by virtue of running in Interactive mode, the Shell Helpers commands can now be used within the JavaScript code. The block of code above contains the additional line show dbs; as the first line, to illustrate this. However, don’t take this example as a recommendation to use Shell Helpers in your scripts. Generally you should avoid using Shell Helpers in any of your Mongo Shell scripts, regardless of which approach you use.

Also, because the Mongo Shell eval option is not being used, the JavaScript code can contain a mix of both single and double quotes, as illustrated by the modified line of code db = db.getSiblingDB("testdb"); shown above, which utilises double-quotes.

Another Observation

It is worth noting that for all of these four methods, apart from the External Script File method, you can reference Bash environment variables inline within the Mongo Shell JavaScript code (as long as double-quotes deliminate the code for the eval methods, rather than single-quotes). For example, from a Bash terminal if you have set a variable with the name of the database to write to...

export DBNAME=testdb

... you can then use the value of this environment variable from within the inline Mongo Shell JavaScript...

db = db.getSiblingDB('${DBNAME}'); factor out the database name. At face value this may not seem particularly powerful until you realise that many build frameworks (e.g. Docker Compose, Ansible, etc.) allow you to declare environment variables within configuration settings before invoking Bash scripts, to factor out environment specific settings.

One bit of caution though, if you are using the MongoDB query operators, they include an ampersand in the syntax (e.g. '&gt', '&exists') which will need to be escaped in these scripts (e.g. '\&gt', '\&exists'). Otherwise Bash will treat each ampersand as a special control character which, in this case, will likely result in being replaced with some empty text.


The following table summarises the main differences between the four approaches to running a JavaScript block of code with the Mongo Shell, from within a larger Bash script:

Song for today: D. Feathers by Bettie Serveert

Sunday, May 27, 2018

Database Support For Decimal Types That Don't Suffer Loss Of Precision

It's become clear to me that some developers naively trust using float/double types in programming languages and in databases, without even considering the loss of precision that will occur and any detrimental impact on their business applications. Maybe the loss of precision isn't an issue for some types of applications, but I believe it's important to assess that risk before choosing to accept it.

Here's a demonstration of what can happen in programming languages and why you need to consider it...

Python example:
$ python -c "print('%0.18f' % (0.1 * 0.2))"

Node (JavaScript) example:
$ node -e 'console.log(0.1 * 0.2)'

Mongo Shell (JavaScript) example:
$ mongo --nodb --quiet -eval '0.1 * 0.2'

Java example (using Java's newish JShell tool):
$ printf "System.out.println(0.1 * 0.2); \n/ex\n" | jshell -q

None of this is really a surprise given that most of these high-level programming languages are built using C (or C++), which invariably provides the fundamental building block types for floats and doubles.

C example:
printf "int main(void){ printf(\"%%.18lf\\\n\", (0.1 * 0.2)); return 0; }" | cc -w -x c -o multiply - && ./multiply

Of course, most modern programming languages have libraries for dealing with large decimals requiring exact representation and precision.

For example, Java has the BigDecimal library class for this reason:
$ printf "System.out.println((new BigDecimal(\"0.2\")).multiply(new BigDecimal(\"0.1\"))); \n/ex\n" | jshell -q

When it comes to databases, the same challenges exist when using fields with floating point values. There may be the need to store and retrieve such fields without loss of precision, whilst enabling arithmetic to be conducted on these fields.

Example of using a standard JavaScript/JSON float type for a field in a MongoDB database:
$ mongo
> db.records.drop()
> db.records.insert({val: 0.2})
> db.records.findOne()
    "_id" : ObjectId("5b0a7c24d3dac6c87c0d4a4b"),
    "val" : 0.2
> id = db.records.findOne()._id
> db.records.update({_id: id}, {$mul: {val: 0.1}})
> db.records.findOne()
"_id" : ObjectId("5b0a7c24d3dac6c87c0d4a4b"),
"val" : 0.020000000000000004
> db.records.findOne().val

Like programming languages, most traditional relational databases provide extra types and libraries for using decimals that don't suffer loss of precision. However, most of the so-called NoSQL database don't. MongoDB is one of the exceptions.

Example of using a BSON decimal128 type for a field in a MongoDB database:
$ mongo
> db.records.drop()
> db.records.insert({val: NumberDecimal("0.2")})
> db.records.findOne()
"_id" : ObjectId("5b0a7ce9d3dac6c87c0d4a4c"),
"val" : NumberDecimal("0.2")
> id = db.records.findOne()._id
> db.records.update({_id: id}, {$mul: {val: NumberDecimal("0.1")}})
> db.records.findOne()
"_id" : ObjectId("5b0a7ce9d3dac6c87c0d4a4c"),
"val" : NumberDecimal("0.02")
> db.records.findOne().val

So as you can see, MongoDB can store fields as decimals without precision loss and enable arithmetic and sorting to be performed across these fields. The MongoDB manual provides a lot more information on how to use this decimal field type.

It could be that the loss of precision does not have a major impact on a particular application and the business purpose it is used for. However, in some cases, especially for financial or scientific applications, the ability to store and process decimal fields without loss of precision is likely to be critical.

Song for today: Quit It by Strand of Oaks

Friday, April 13, 2018

MongoDB Graph Query Example, Inspired by Designing Data-Intensive Applications Book


People who have worked with me recently are probably bored by me raving about how good this book is: Designing Data-Intensive Applications by Martin Kleppmann (O'Reilly, 2016). Suffice to say, if you are in IT and have any sort of interest in databases and/or data-driven applications, you should read this book. You will be richly rewarded for the effort.

In the second chapter of the book ('Data Models and Query Languages'), Martin has a section called 'Graph Like Data Models' which explores 'graph use cases' where many-to-many relationships are typically modelled with tree-like structures, with indeterminate numbers of inter-connections. The book section shows how a specific 'graph problem' can be solved by using a dedicated graph database technology with associated query language (Cypher) and by using an ordinary relational database with associated query language (SQL). One thing that quickly becomes evident, when reading this section of the book, is how difficult it is in a relational database to model complex many-to-many relationships. This may come as a surprise to some people. However, this is consistent with something I've subconsciously learnt over 20 years of using relational databases, which is, relationships ≠ relations, in the world of RDBMS.

The graph scenario illustrated in the book shows an example of two people, Lucy and Alain, who are married to each other, who are born in different places and who now live together in a third place. For clarity, I've included the diagram from the book, below, to best illustrate the scenario (annotated with the book's details, in red, for reference).

Throughout the book, numerous types of databases and data-stores are illustrated, compared and contrasted, including MongoDB in many places. However the book's section on graph models doesn't show how MongoDB can be used to solve the example graph scenario. Therefore, I thought I take this task on myself. Essentially, the premise is that there is a data-set of many people, with data on the place each person was born in and the place each person now lives in. Of course, any given place may be within a larger named place, which may in turn be within a larger named place, and so on, as illustrated in the diagram above. In the rest of this blog post I show one way that such data structures and relationships can be modelled in MongoDB and then leveraged by MongoDB's graph query capabilities (specifically using the graph lookup feature of MongoDB's Aggregation Framework). What will be demonstrated is how to efficiently answer the exam question posed by the book, namely: 'Find People Who Emigrated From US To Europe'.

Solving The Book's Graph Challenge With MongoDB

To demonstrate the use of MongoDB's Aggregation 'graph lookup' capability to answer the question 'Find People Who Emigrated From US To Europe', I've created the following two MongoDB collections, populated with data:
  1. 'persons' collection. Contains around one million randomly generated person records, where each person has 'born_in' and 'lives_in' attributes, which each reference a 'starting' place record in the places collection.
  2. 'places' collection. Contains hierarchical geographical places data, with the graph structure of: SUBDIVISIONS-->COUNTRIES-->SUBREGIONS-->CONTINENTS. Note: The granularity and hierarchy of the data-set is slightly different than illustrated in the book, due to the sources of geographical data I had available to cobble together.
Similar to the book's example, amongst the many 'persons' records stored in MongoDB data-set, are the following two records relating to 'Lucy' and 'Alain'.

{fullname: 'Lucy Smith', born_in: 'Idaho', lives_in: 'England'}
{fullname: 'Alain Chirac', born_in: 'Bourgogne-Franche-Comte', lives_in: 'England'}

Below is an excerpt of some of the records from the 'places' collection, which illustrates how a place record may refer to another place record, via its 'part_of' attribute.

{name: 'England', type: 'subdivision', part_of: 'United Kingdom of Great Britain and Northern Ireland'}
{name: 'United Kingdom of Great Britain and Northern Ireland', type: 'country', part_of: 'Northern Europe'}
{name: 'Northern Europe', type: 'subregion', part_of: 'Europe'}
{name: 'Europe', type: 'continent', part_of: ''}

If you want to access this data yourself and load it into the two MongoDB database collections, I've created JSON exports of both collections and made these available in a GitHub project (see the project's README for more details on how to load the data into MongoDB and then how to actually run the example's 'graph lookup' aggregation pipeline).

The MongoDB aggregation pipeline I created, to process the data across these two collections and to answer the question 'Find People Who Emigrated From US To Europe', has the following stages:
  1. $graphLookup: For every record in the 'persons' collection, using the person's 'born_in' attribute, locate the matching record in the  'places' collection and then walk the chain of ancestor place records building up a hierarchy of 'born in' place names.
  2. $match: Only keep 'persons' records, where the 'born in' hierarchy of discovered place names includes 'United States of America'.
  3. $graphLookup: For each of these remaining 'persons' records, using each person's 'lives_in' attribute, locate the matching record in the 'places' collection and then walk the chain of ancestor place records building up a hierarchy of 'lives in' place names.
  4. $match: Only keep around the remaining 'persons' records, where the 'lives in' hierarchy of discovered place names includes 'Europe'.
  5. $project: For the resulting records to be returned, just show the attributes 'fullname', 'born_in' and 'lives_in'.

The actual MongoDB Aggregation Pipeline for this is:

    {$graphLookup: {
        from: 'places',
        startWith: '$born_in',
        connectFromField: 'part_of',
        connectToField: 'name',
        as: 'born_hierarchy'
    {$match: {'': born}},
    {$graphLookup: {
        from: 'places',
        startWith: '$lives_in',
        connectFromField: 'part_of',
        connectToField: 'name',
        as: 'lives_hierarchy'
    {$match: {'': lives}},
    {$project: {
        _id: 0,
        fullname: 1, 
        born_in: 1, 
        lives_in: 1, 

When this aggregation is executed, after first declaring values for the variables highlighted in red...

var born = 'United States of America', lives = 'Europe'

...the following is an excerpt of the output that is returned by the aggregation:

{fullname: 'Lucy Smith', born_in: 'Idaho', lives_in: 'England'}
{fullname: 'Bobby Mc470', born_in: 'Illinois', lives_in: 'La Massana'}
{fullname: 'Sandy Mc1529', born_in: 'Mississippi', lives_in: 'Karbinci'}
{fullname: 'Mandy Mc2131', born_in: 'Tennessee', lives_in: 'Budapest'}
{fullname: 'Gordon Mc2472', born_in: 'Texas', lives_in: 'Tyumenskaya oblast'}
{fullname: 'Gertrude Mc2869', born_in: 'United States of America', lives_in: 'Planken'}
{fullname: 'Simon Mc3087', born_in: 'Indiana', lives_in: 'Ribnica'}

On my laptop, using the data-set of a million person records, the aggregation takes about 45 seconds to complete. However, if I first define the index...

db.places.createIndex({name: 1})

...and then run the aggregation, it only takes around 2 seconds to execute. This shows just how efficiently the 'graphLookup' capability is able to walk a graph of relationships, by leveraging an appropriate index.


I've shown the expressiveness and power of MongoDB's aggregation framework, combined with 'graphLookup' pipeline stages, to perform a query of a graph of relationships across many records. A 'graphLookup' stage is efficient as it avoids the need to develop client application logic to programmatically navigate each hop of a graph of relationships, and thus avoids the network round trip latency that a client, traversing each hop, would otherwise incur. The 'graphLookup' stage can and should leverage an index, to enable the 'tree-walk' process to be even more efficient.

Although MongoDB may not be as rich in terms of the number of graph processing primitives it provides, compared with 'dedicated' graph databases, it possesses some key advantages for 'graph' use cases:
  1. Business Critical Applications. MongoDB is designed for, and invariably deployed as a realtime operational database, with built-in high availability and enterprise security capabilities to support realtime business critical uses. Dedicated graph databases tend to be built for 'back-office' and 'offline' analytical uses, with less focus on high availability and security. If there is a need to leverage a database to respond to graph queries in realtime for applications sensitive to latency, availability and security, MongoDB is likely to be a great fit.
  2. Cost of Ownership & Timeliness of Insight. Often, there may be requirements to satisfy CRUD random realtime operations on individual data records and satisfy graph-related analysis of the data-set as a whole. Traditionally, this would require an ecosystem containing two types of database, an operational database and a graph analytical database. A set of ETL processes would then need to be developed to keep the duplicated data synchronised between the two databases. By combining both roles in a single MongoDB distributed database, with appropriate workload isolation, the financial cost of this complexity can be greatly reduced, due to a far simpler deployment. Additionally, and as a consequence, there will be no lag that arises when keeping one copy of data in one system, up to date with the other copy of the data in another system. Rather than operating on stale data, the graph analytical workloads operate on current data to provide more accurate business insight.

Song for today: Cosmonauts by Quicksand