Paul Done's Technical Blog: June 2017

Friday, June 30, 2017

Using the Enterprise Version of MongoDB on GKE Kubernetes

[Part 3 in a series of posts about running MongoDB on Kubernetes, with the Google Kubernetes Engine (GKE). See the GitHub project gke-mongodb-demo for an example scripted deployment of MongoDB to GKE, that you can easily try yourself. The gke-mongodb-demo project combines the conclusions from all the posts in this series so far. Also see: http://k8smongodb.net/]

Introduction

In the previous two posts of my blog series (1, 2) about running MongoDB on GKE's Kubernetes environment, I showed how to ensure a MongoDB Replica Set is secure by default, resilient to system failures and how to ensure various best practice "production" environment settings are in place. In those examples, the community version of the MongoDB binaries were used. In this blog post I show how the enterprise version of MongoDB can be utilised, instead.

Referencing a Docker Image for Use in Kubernetes

For the earlier two blog post examples, a pre-built "mongo" Docker image was "pulled" by Kubernetes, from Docker Hub's "official" MongoDB repository. Below is an excerpt of the Kubernetes StatefulSet definition, that shows how this image was referenced (highlighted in bold):

$ cat mongodb-service.yaml
....
....
containers:
- name: mongod-container
image: mongo
command:
- "mongod"
....
....

By default, if no additional metadata is provided, Kubernetes will look in the Docker Hub repository for the image with the given name. Other repositories such as Google's Container Registry, AWS's EC2 Container Registry, Azure's Container Registry, or any private repository, can be used instead.

It is worth clarifying what is meant by the "official" MongoDB repository. This is a set of images that are "official" from Docker Hub's perspective because Docker Hub manages how the images are composed and built. They are not, however, official releases from MongoDB Inc.'s perspective. When the Docker Hub project builds an image, in addition to sourcing the "mongod" binary from MongoDB Inc's website, other components like the underlying Debian OS, plus various custom scripts, are generated into the image too.

At the time of this blog post, Docker Hub currently provides images for MongoDB community versions 3.0, 3.2, 3.4 and 3.5 (unstable). The Docker Hub repository only contains images for the community version of MongoDB and not the enterprise version.

Building a Docker Image Using the MongoDB Enterprise binaries

You can build a Docker image to run "mongod" in any way you want, using your own custom Dockerfile to define how the image should be generated. The Docker manual even provides a tutorial for creating a custom Docker image specifically for MongoDB, called Dockerize MongoDB. That process can be followed as the basis for building an image which pulls down and uses the enterprise version of MongoDB, rather than the community version.

However, to generate an image containing the enterprise MongoDB version, it isn't actually necessary to create your own custom Dockerfile. This is because, a few weeks ago, I created a pull request to add support for building MongoDB enterprise based images, as part of the normal "mongo" GitHub project that is used to generate the "official" Docker Hub "mongo" images. The GitHub project owner accepted and merged this enhancement a few days later. My enhancement essentially allows the project's Dockerfile, when used by Docker's build tool, to instruct that the enterprise MongoDB binaries should be downloaded into the image, rather than the community ones. This is achieved by providing some "build arguments" to the Docker build process, which I will detail further below. This doesn't mean the Docker Hub's "official repository" for MongoDB now contains pre-generated "enterprise mongod" images ready to use. It just means you can use the source Github project directly, without any changes to the project's Dockerfile, to generate your own image, containing the enterprise binary.

To build the Docker "mongo" image, that pulls in the enterprise version of MongoDB, follow these steps:

1. Install Docker locally on your own workstation/laptop.

2. Download the source files for the Docker Hub "mongo" project (on the project page, click the "Clone or download" green button, then click the "Download zip" link and once downloaded, unpack into a new local folder).

3. From a command line shell, in the project folder, change directory to the sub-folder of the major version you want to build (e.g. "3.4"), and run:

$ docker build -t pkdone/mongo-ent:3.4 --build-arg MONGO_PACKAGE=mongodb-enterprise --build-arg MONGO_REPO=repo.mongodb.com .

This tells Docker to build an image using the Dockerfile in the current directory (".") with the resulting image name being "pkdone/mongo-ent" and tag being ":3.4". By convention, the image name is prefixed by the author's username, which in my case is "pkdone" (obviously this should be replaced by a different prefix, for whoever follows these steps).

The two new "--build-arg" parameters, "MONGO_PACKAGE" and "MONGO_REP" are passed to the "mongo" Dockerfile. The version of the Dockerfile with my enhancements uses these two parameters to locate where to download the specific type of MongoDB binary from. In this case, the values specified mean that the enterprise binary is pulled down into the generated image. If no "build args" are specified, the community version of MongoDB is used, by default.

Important note: When running the "docker build" command above, because the enterprise version of MongoDB will be downloaded, it will mean you are implicitly accepting MongoDB Inc's associated commercial licence terms.

Once the image is generated, you can also quickly test the image by running it in a Docker container on your local machine (as just a "mongod" single instance):

$ docker run -d --name mongo -t pkdone/mongo-ent:3.4

To be sure this is running properly, connect to the container using a shell, check if the "mongod" process is running and check that the Mongo Shell can connect to the containerised "mongod" process.

$ docker exec -it mongo bash
$ ps -aux
$ mongo

The output of the Shell should include the prompt "MongoDB Enterprise >" which shows that the database is using the enterprise version of MongoDB. Exit out of the Mongo Shell, exit out of the container and then from the local machine, run the command to view the "mongod" container's logged output (with example results shown):

$ docker logs mongo | grep enterprise

2017-07-01T12:08:42.177+0000 I CONTROL [initandlisten] modules: enterprise

Again this result should demonstrate that the enterprise version of MongoDB has been used.

Using the Generated Enterprise Mongod Image from the Kubernetes Project

The easiest way to use the "mongod" container image that has just been created, from GKE Kubernetes, is to first register it with Docker Hub, using the following steps:

  1. Create a new free account on Docker Hub.

  2. Run the following commands to associate your local workstation environment with your new Docker Hub account, to list the built image registered on your local machine, and to push this newly generated image to your remote Docker Hub account:

$ docker login
$ docker images
$ docker push pkdone/mongo-ent:3.4

  3. Once the image has finished uploading, in a browser, return to the Docker Hub site and log-in to see the list of your registered repository instances. The newly pushed image should now be listed there.

Now back in the GKE Kubernetes project, for the "mongod" Service/StatefulSet resource definition, change the image reference to be the newly uploaded Docker Hub image, as shown below (highlighted in bold):

$ cat mongodb-service.yaml
....
....
containers:
- name: mongod-container
  image: pkdone/mongo-ent:3.4
command:
- "mongod"
....
....

Now re-perform all the normal steps to deploy the Kubernetes cluster and resources as outlined in the first blog post in the series. Once the MongoDB Replica Set is up and running, you can check the output logs of the first "mongod" container, to see if the enterprise version of MongoDB is being used:

$ kubectl logs mongod-0 | grep enterprise

2017-07-01T13:01:42.794+0000 I CONTROL [initandlisten] modules: enterprise

Summary

In this blog post I’ve shown how to use the enterprise version of MongoDB, when running a MongoDB Replica Set, using Kubernetes StatefulSets, on the Google Kubernetes Engine. This builds on the work done in previous blog posts in the series, around ensuring MongoDB Replica Sets are resilient and better tuned for production workloads, when running on Kubernetes.

[Next post in series: Deploying a MongoDB Sharded Cluster using Kubernetes StatefulSets on GKE]

Song for today: Standing In The Way Of Control by Gossip

Tuesday, June 27, 2017

Configuring Some Key Production Settings for MongoDB on GKE Kubernetes

[Part 2 in a series of posts about running MongoDB on Kubernetes, with the Google Kubernetes Engine (GKE). See the GitHub project gke-mongodb-demo for an example scripted deployment of MongoDB to GKE, that you can easily try yourself. The gke-mongodb-demo project combines the conclusions from all the posts in this series so far. Also see: http://k8smongodb.net/]

Introduction

In the first part of my blog series I showed how to deploy a MongoDB Replica Set to GKE's Kubernetes environment, whilst ensuring that the replica set is secure by default and resilient to various types of system failures. As mentioned in that post, there are number of other "production" considerations that need to be made when running MongoDB in Kubernetes and Docker environments. These considerations are primarily driven by the best practices documented in MongoDB’s Production Operations Checklist and Production Notes. In this blog post, I will address how to apply some (but not all) of these best practices, on GKE's Kubernetes platform.

Host VM Modifications for Using XFS & Disabling Hugepages

For optimum performance, the MongoDB Production Notes strongly recommend applying the following configuration settings to the host operating system (OS):

Use an XFS based Linux filesystem for WiredTiger data file persistence.
Disable Transparent Huge Pages.

The challenge here is that neither of these elements can be configured directly within normally deployed pods/containers. Instead, they need to be set in the OS of each machine/VM that is eligible to host one or more pods and their containers. Fortunately, after a little googling I found a solution to incorporating XFS, in the article Mounting XFS on GKE, which also provided the basis for deriving a solution for disabling Huge Pages too. It turns out that in Kubernetes, it is possible to run a pod (and its container) once per node (host machine), using a facility called a DaemonSet. A DaemonSet is used to schedule a "special" container to run on every newly provisioned node, as a one off, before any "normal" containers are scheduled and run on the node. In addition, for Docker based containers (the default on GKE Kubernetes), the container can be allowed to run in a privileged mode, which gives the "privileged" container access to other Linux Namespaces running in the same host environment. With heightened security rights the "privileged" container can then run a utility called nsenter ("NameSpace ENTER") to spawn a shell using the namespace belonging to the host OS ("/proc/1"). The script that the shell runs can then essentially perform any arbitrary root level actions on the underlying host OS.

So with this in mind, the challenge is to build a Docker container image that, when run in privileged mode, uses "nsenter" to spawn a shell to run some shell script commands. As luck would have it, such a container has already been created, in a generic way, as part of the Kubernetes "contributions" project, called startup-script. The generated "startup-script" Docker image has been registered and and made available in the Google Container Registry, ready to be pulled in and used by anyone's Kubernetes projects.

Therefore on GKE, to create a DaemonSet leveraging the "startup-script" image in privileged mode, we first need to define the DaemonSet's configuration:

$ cat hostvm-node-configurer-daemonset.yaml

kind: DaemonSet
apiVersion: extensions/v1beta1
metadata:
name: hostvm-configurer
labels:
app: startup-script
spec:
template:
metadata:
labels:
app: startup-script
spec:
hostPID: true
containers:
- name: hostvm-configurer-container
image: gcr.io/google-containers/startup-script:v1
securityContext:
privileged: true
env:
- name: STARTUP_SCRIPT
value: |
#! /bin/bash
set -o errexit
set -o pipefail
set -o nounset

# Disable hugepages
echo 'never' > /sys/kernel/mm/transparent_hugepage/enabled
echo 'never' > /sys/kernel/mm/transparent_hugepage/defrag

# Install tool to enable XFS mounting
apt-get update || true
apt-get install -y xfsprogs

Shown in bold at the base of the file, you will notice the commands used to disable Huge Pages and to install the XFS tools for mounting and formatting storage using the XFS filesystem. Further up the file, in bold, is the reference to the 3rd party "startup-script" image from the Google Container Registry and the security context setting to state that the container should be run in privileged mode.

Next we need to deploy the DaemonSet with its "start-script" container to all the hosts (nodes), before we attempt to create any GCE disks, that need to be formatted as XFS:

$ kubectl apply -f hostvm-node-configurer-daemonset.yaml

In the GCE disk definitions, described in the first blog post in this series (i.e. "gce-ssd-persistentvolume?.yaml"), an addition of a new parameter needs to be made (shown in bold below) to indicate that the disk's filesystem type needs to be XFS:

apiVersion: "v1"
kind: "PersistentVolume"
metadata:
name: data-volume-1
spec:
capacity:
storage: 30Gi
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
storageClassName: fast
gcePersistentDisk:
fsType: xfs
pdName: pd-ssd-disk-1

Now in theory, this should be all that is required to get XFS working. Except on GKE, it isn't!

After deploying the DaemonSet and creating the GCE storage disks, the deployment of the "mongod" Service/StatefulSet will fail. The StatefulSet's pods do not to start properly because the disks can't be formatted and mounted as XFS. It turns out that this is because, by default, GKE uses a variant of Chromium OS as the underlying host VM that runs the containers, and this OS flavour doesn't support XFS. However, GKE can also be configured to use a Debian based Host VM OS instead, which does support XFS.

To see the list of host VM OSes that GKE supports, the following command can be run:

$ gcloud container get-server-config
Fetching server config for europe-west1-b
defaultClusterVersion: 1.6.4
defaultImageType: COS
validImageTypes:
- CONTAINER_VM
- COS
....

Here, "COS" is the label for the Chromium OS and "CONTAINER_VM" is the label for the Debian OS. The easiest way to start leveraging the Debian OS image is to clear out all the GCE/GKE resources and Kubernetes cluster from the current project and start deployment all over again. This time, when the initial command is run to create the new Kubernetes cluster, an additional argument (shown in bold) must be provided to define that the Debian OS should be used for each Host VM that is created as a Kubernetes node.

$ gcloud container clusters create "gke-mongodb-demo-cluster" --image-type=CONTAINER_VM

This time, when all the Kubernetes resources are created and deployed, the "mongod" containers correctly utilise XFS formatted persistent volumes.

If this all seems a bit complicated, it is probably helpful to view the full end-to-end deployment flow, provided in my example GitHub project gke-mongodb-demo.

There is one final observation to make before finishing the discussion on XFS. In Google's online documentation, it is stated that the Debian Host VM OS is deprecated in favour of Chromium OS. I hope that in the future Google will add XFS support directly to its Chromium OS distribution, to make the use of XFS a lot less painful and to ensure XFS can still be used with MongoDB, if the Debian Host VM option is ever completely removed.

UPDATE 13-Oct-2017: Google has recently updated GKE and in addition to upgrading Kubernetes to version 1.7, has removed the old Debain container VM option and added an Ubuntu container VM option, instead. In the new Ubuntu container, the XFS tools are already installed, and therefore do not need to be configured by the DaemonSet. The gke-mongodb-demo project has been updated accordingly, to use the new Ubuntu container VM and to omit the command to install the XFS tools.

Disabling NUMA

For optimum performance, the MongoDB Production Notes recommend that "on NUMA hardware, you should configure a memory interleave policy so that the host behaves in a non-NUMA fashion". The DockerHub "mongo" container image which has been used so far with Kubernetes in this blog series, already contains some bootstrap code to start the "mongod" process with the "numactl --interleave=all" setting. This setting makes the process environment behave in a non-NUMA way.

However, I believe it is worth specifying the "numactl" settings explicitly in the "mongod" Service/StatefulSet resource definition, anyway, just in case other users choose to use an alternative or self-built Docker image for the "mongod" container. The excerpt below shows the added "numactl" elements (in bold), required to run the containerised "mongod" process in a "non-NUMA" manner.

$ cat mongodb-service.yaml
....
....
containers:
- name: mongod-container
image: mongo
command:
- "numactl"
- "--interleave=all"
- "mongod"
....
....

Controlling CPU & RAM Resource Allocation Plus WiredTiger Cache Size

Of course, when you are running a MongoDB database it is important to size both CPU and RAM resources correctly for the particular database workload, regardless of the type of host environment. In a Kubernetes containerised host environment, the amount of CPU & RAM resource dedicated to a container can be defined in the "resource" section of the container's declaration, as shown in the excerpt of the "mongod" Service/StatefulSet definition below:

$ cat mongodb-service.yaml
....
....
containers:
- name: mongod-container
image: mongo
command:
- "mongod"
- "--wiredTigerCacheSizeGB"
- "0.25"
- "--bind_ip"
- "0.0.0.0"
- "--replSet"
- "MainRepSet"
- "--auth"
- "--clusterAuthMode"
- "keyFile"
- "--keyFile"
- "/etc/secrets-volume/internal-auth-mongodb-keyfile"
- "--setParameter"
- "authenticationMechanisms=SCRAM-SHA-1"
resources:
requests:
cpu: 1
memory: 2Gi
....
....

In the example (shown in bold), 1x virtual CPU (vCPU) and 2GB of RAM have been requested to run the container. You will also notice that an additional parameter has been defined for "mongod", specifying the WiredTiger internal cache size ("--wiredTigerCacheSizeGB"). In a containerised environment it is absolutely vital to explicitly state this value. If this is not done, and multiple containers end up running on the same host machine (node), MongoDB's WiredTiger storage engine may attempt to take more memory than it should. This is because of the way a container "reports" it's memory size to running processes. As per the MongoDB Production Recommendations, the default cache size guidance is: "50% of RAM minus 1 GB, or 256 MB". Given that the amount of memory requested is 2GB, the WiredTiger cache size here, has been set to 256MB.

If and when you define a different amount of memory for the container process, be sure to also adjust the WiredTiger cache size setting accordingly, otherwise the "mongod" process may not leverage all the memory reserved for it, by the container.

Controlling Anti-Affinity for Mongod Replicas

When running a MongoDB Replica Set, it is important to ensure that none of the "mongod" replicas in the replica set are running on the same host machine as each other, to avoid inadvertently introducing a single point of failure. In a Kubernetes containerised environment, if containers are left to their own devices, different "mongod" containers could end up running on the same nodes. Kubernetes provides a way of specifying pod anti-affinity to prevent this from occurring. Below is an excerpt of a "mongod" Services/StatefulSet resource file which declares an anti-affinity configuration.

$ cat mongodb-service.yaml
....
....
serviceName: mongodb-service
replicas: 3
template:
metadata:
labels:
replicaset: MainRepSet
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: replicaset
operator: In
values:
- MainRepSet
topologyKey: kubernetes.io/hostname
....
....

Here, a rule has been defined that asks Kubernetes to apply anti-affinity when deploying pods with the label "replicaset" equal to "MainRepSet", by looking for potential matches on the host VM instance's hostname, and then avoiding them.

Setting File Descriptor & User Process Limits

When deploying the MongoDB Replica Set on GKE Kubernetes, as demonstrated in the current GitHub project gke-mongodb-demo, you may notice some warning about "rlimits" in the output of each containerised mongod's logs. These log entries can be viewed by running the following command:

$ kubectl logs mongod-0 | grep rlimits

2017-06-27T12:35:22.018+0000 I CONTROL [initandlisten] ** WARNING: soft rlimits too low. rlimits set to 29980 processes, 1000000 files. Number of processes should be at least 500000 : 0.5 times number of files.

The MongoDB manual provides some recommendations concerning the system settings for the maximum number of processes and open files when running a "mongod" process.

Unfortunately, thus far, I've not established an appropriate way to the enforce these thresholds using GKE Kubernetes. This topic will possibly be the focus of a blog post for another day. However, I thought that it would be informative to highlight the issue here, with the supporting context, to allow others the chance to resolve it first.

UPDATE 13-Oct-2017: Google has recently updated GKE and in addition to upgrading Kubernetes to version 1.7, has removed the old Debain container VM option and added an Ubuntu container VM option, instead. The default "rlimits" settings in the Ubuntu container VM, are already appropriate for running MongoDB. Therefore a fix is no longer required to address this issue.

Summary

In this blog post I’ve provided some methods for addressing certain best practices when deploying a MongoDB Replica Set to the Google Kubernetes Engine. Although this post does not provide an exhaustive list of best practice solutions, I hope it proves useful for others (and myself) to build upon, in the future.

[Next post in series: Using the Enterprise Version of MongoDB on GKE Kubernetes]

Song for today: The Mountain by Jambinai

Sunday, June 25, 2017

Deploying a MongoDB Replica Set as a GKE Kubernetes StatefulSet

[Part 1 in a series of posts about running MongoDB on Kubernetes, with the Google Kubernetes Engine (GKE). See the GitHub project gke-mongodb-demo for an example scripted deployment of MongoDB to GKE, that you can easily try yourself. The gke-mongodb-demo project combines the conclusions from all the posts in this series so far. Also see: http://k8smongodb.net/]

Introduction

A few months ago, Sandeep Dinesh of Google wrote an informative blog post about Running MongoDB on Kubernetes with StatefulSets on Google’s Cloud Platform. I found this to be a great resource to bootstrap my knowledge of Kubernetes’ new StatefulSets feature, and food for thought for approaches for deploying MongoDB on Kubernetes generally. StatefulSets is Kubernetes’ framework for providing better support for “stafeful applications”, such as databases and message queues. StatefulSets provides the capabilities of stable unique network hostnames and stable dedicated network storage volume mappings, essential for a database cluster to function properly and for data to exist and outlive the lifetime of inherently ephemeral containers.

My view of the approach in the Google blog post, is it is a great way for a developer to rapidly spin up a MongoDB Replica Set, to quickly test that their code still works correctly (it should) in a clustered environment. However, the approach cannot be regarded as a best practice for deploying MongoDB in Production, for mission critical use cases. This assertion is not a criticism, as the blog post is obviously intended to show the art of the possible (which it does very eloquently), and the author makes no claim to be a seasoned MongoDB administration expert.

So what are the challenges for Production deployments, in the approach outlined in the Google blog post? Well there are two problems, which I will address in this post:

Use of a MongoDB/Kubernetes sidecar per Pod, to control Replica Set configuration. Essentially, the sidecar wakes up every 5 seconds, checks which MongoDB pods are running and then reconfigures the replica-set, on the fly. It adds any MongoDB servers it can see, to the replica set configuration, and removes any servers it can no longer see. This is dangerous for many reasons. I’ve highlighted two of the most important reasons why here*:

This introduces the real risk of split-brain, in the event of a network partition. For example, normally, if there is a 3 node replica set configured and the primary is somehow separated from the secondaries, the primary will step down as it can’t maintain a majority. Normally, the two secondaries that can now only see each other, will form a quorum and one of these two will then become the primary. In the sidecar implementation, during a network split, the sidecar on the primary believes the two secondaries aren’t running and it re-configures the replica set on the fly, to now just have one member. This remaining member believes it can still act as primary (because it has achieved a majority of 1 out of 1 votes). The sidecars still running on the other two members, now also reconfigure the replica set to be just those two members. One of these two members automatically becomes a primary (because it has achieved a majority of 2 out of 2 votes). As a result there are now two primaries in existence for the same replica-set, which a normal and properly configured MongoDB cluster would never allow to occur. MongoDB’s strong consistency guarantees are subverted and non-deterministic things will start happening to the data. In a properly deployed MongoDB cluster, if there is a 3 node replica set and 2 nodes appear to be down, it doesn’t mean you now have a 1 node replica set, you don’t. You still have a 3 node replica-set, albeit only one replica appears to be currently running (and hence no primary is permitted, to guarantee safety and strong consistency).
Many applications updating data in MongoDB will use “WriteConcerns” set to a value such as “majority”, to provide levels of guarantee for safe data updates across a cluster. The whole notion of a “WriteConcern” would become meaningless in the sidecar controlled environment, because the constantly re-configured replica set would always reflect a total replica-set size of just those replicas currently active and reachable. For example, performing a database update operation with “WriteConcern” of “majority” would always be permitted, regardless of whether all 3 replicas are currently available, or just 2 replicas are or just 1 replica is.

Insecure by default, due to authentication not being enabled. In a Production environment, running MongoDB with authentication disabled should never be allowed. Even if the intention is to configure authentication as a later provisioning step, the database is potentially exposed and insecure for seconds, minutes or longer. As a result, the “mongod” process should always be started with authentication enabled (e.g. using “--auth” command line flag), even during any “bootstrap provisioning process”. MongoDB’s localhost exception should be relied upon to securely configure one or more database users.

* If this was such an easy and safe thing to do, MongoDB replicas would be built to automatically perform these re-configurations, themselves, in a separate background thread running inside each “mongod” replica process. The brain controlling what the replica-set configuration should look like, lives outside the cluster for good reason (e.g. inside the head of an administrator, or preferably, inside a configuration file that is used to drive a higher level orchestration tool which operates above the “containers layer”).

Additionally, there are number of other considerations that aren’t just specific to the approach in the referenced Google blog post, but are applicable to the use of Docker/Kubernetes with MongoDB, generally. These consideration can be categorised as ways to ensure that MongoDB’s best practices are followed, as documented in MongoDB’s Production Operations Checklist and Production Notes. I address some of these best practice omissions in the next post in this series: Configuring Some Key Production Settings for MongoDB on GKE Kubernetes. It is probably worth me being clear here, that I am not claiming my blog series will get users 100% to where they need to be, to deploy a fully operational, secure and well-performing MongoDB Clusters on GKE. Instead, what I hope the series will do, is enable users to build on my findings and recommendations, so there are less gaps for them to address, when planning their own production environment.

For the rest of this blog post, I will focus on the steps required to deploy a MongoDB Replica Set, on GKE, addressing the replica-set resiliency and security concerns that I've highlighted above.

Steps to Deploy MongoDB to GKE, using StatefulSets

The first thing to do, if you haven’t already, is sign up to use the Google Cloud Platform (GCP). To keeps things simple, you can sign up to a free trial for GCP. Note: The free trial places some restrictions on account resource quotas, in particular restricting storage to a maximum of 100GB. Therefore, in my series of the blog posts and my sample GitHub project, I employ modest disk sizes, to remain under this threshold.

Once your GCP account is activated, you should download and install GCP’s client command line tool, called “gcloud”, to your local Linux/Windows/Mac workstation.

With “gcloud” installed, run the following commands to configure the local environment to use your GCP account, to install the main Kubernetes command tool (“kubectl”), to configure authentication credentials, and to define the default GCP zone to be deployed to:

$ gcloud init
$ gcloud components install kubectl
$ gcloud auth application-default login
$ gcloud config set compute/zone europe-west1-b

Note: If you want to specify an alternative zone to deploy to in the above command, you can first view the list of available zones by running the command: $ gcloud compute zones list

You should now be ready to create a brand new Kubernetes cluster to the Google Kubernetes Engine. Run the following command to provision a new Kubernetes cluster called “gke-mongodb-demo-cluster”:

$ gcloud container clusters create "gke-mongodb-demo-cluster"

As part of this process, a set of 3 GCE VM instances are automatically provisioned, to run Kubernetes cluster nodes ready to host pods of containers.

You can view the state of the deployed Kubernetes cluster using the Google Cloud Platform Console (look at both the “Kubernetes Engine” and the “Compute Engine” sections of the Console).

Next, lets register GCE’s fast SSD persistent disks to be used in the cluster:

$ cat gce-ssd-storageclass.yaml

kind: StorageClass
apiVersion: storage.k8s.io/v1beta1
metadata:
name: fast
provisioner: kubernetes.io/gce-pd
parameters:
type: pd-ssd

$ kubectl apply -f gce-ssd-storageclass.yaml

Then run the commands to allocate 3 lots of Google Cloud storage, of size 30GB, using the fast SSD persistent disks, followed by a query to show the status of those newly created disks:

$ gcloud compute disks create --size 30GB --type pd-ssd pd-ssd-disk-1
$ gcloud compute disks create --size 30GB --type pd-ssd pd-ssd-disk-2
$ gcloud compute disks create --size 30GB --type pd-ssd pd-ssd-disk-3
$ gcloud compute disks list

Now, declare 3 Kubernetes “Persistent Volume” definitions, that each reference one of the storage disks just created:

$ cat gce-ssd-persistentvolume1.yaml

apiVersion: "v1"
kind: "PersistentVolume"
metadata:
name: data-volume-1
spec:
capacity:
storage: 30Gi
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
storageClassName: fast
gcePersistentDisk:
pdName: pd-ssd-disk-1

$ kubectl apply -f gce-ssd-persistentvolume1.yaml

(repeat for Disks 2 and 3, using similar files, “gce-ssd-persistentvolume2.yml” and “gce-ssd-persistentvolume3.yml” respectively, with the fields “name: data-volume-?” and “pdName: pd-ssd-disk-?” set in each file)

Once the three Persistent Volumes are configured, their status can be viewed with the following command:

$ kubectl get persistentvolumes

This will show that the state of each volume is marked as “available” (i.e. no container has staked a claim on each yet).

A key deviation from the original Google blog post, is enabling MongoDB authentication immediately, before any "mongod" processes are started. Enabling authentication for a MongoDB replica set doesn’t just enforce authentication of applications using MongoDB, but also enforces internal authentication for inter-replica communication. Therefore, lets generate a keyfile to be used for internal cluster authentication and register it as a Kubernetes Secret:

$ TMPFILE=$(mktemp)
$ /usr/bin/openssl rand -base64 741 > $TMPFILE
$ kubectl create secret generic shared-bootstrap-data –from file=internal-auth-mongodb-keyfile=$TMPFILE
$ rm $TMPFILE

This generates a random key into a temporary file and then uses the Kubernetes API to register it as a Secret, before deleting the file. Subsequently, the Secret will be made accessible to each “mongod”, via a volume mounted by each host container.

For the final Kubernetes provisioning step, we need to prepare the definition of the Kubernetes Service and StatefulSet for MongoDB, which, amongst other things, encapsulates the configuration of the “mongod” Docker container to be run.

$ cat mongodb-service.yaml

apiVersion: v1
kind: Service
metadata:
name: mongodb-service
labels:
name: mongo
spec:
ports:
- port: 27017
targetPort: 27017
clusterIP: None
selector:
role: mongo
---
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
name: mongod
spec:
serviceName: mongodb-service
replicas: 3
template:
metadata:
labels:
role: mongo
environment: test
replicaset: MainRepSet
spec:
terminationGracePeriodSeconds: 10
volumes:
- name: secrets-volume
secret:
secretName: shared-bootstrap-data
defaultMode: 256
containers:
- name: mongod-container
image: mongo
command:
- "mongod"
- "--bind_ip"
- "0.0.0.0"
- "--replSet"
- "MainRepSet"
- "--auth"
- "--clusterAuthMode"
- "keyFile"
- "--keyFile"
- "/etc/secrets-volume/internal-auth-mongodb-keyfile"
- "--setParameter"
- "authenticationMechanisms=SCRAM-SHA-1"
ports:
- containerPort: 27017
volumeMounts:
- name: secrets-volume
readOnly: true
mountPath: /etc/secrets-volume
- name: mongodb-persistent-storage-claim
mountPath: /data/db
volumeClaimTemplates:
- metadata:
name: mongodb-persistent-storage-claim
annotations:
volume.beta.kubernetes.io/storage-class: "fast"
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 30Gi

You may notice that this Service definition varies in some key areas, from the one provided in the original Google blog post. Specifically:

A “Volume” called “secrets-volume” is defined, ready to expose the shared keyfile to each of the “mongod” replicas that will run.
Additional command line parameters are specified for “mongod”, to enable authentication (“--auth”) and to provide related security settings, including the path where “mongod” should locate the keyfile on its local filesystem.
In the “VolumeMounts” section, the mount point path is specified for the Volume that holds the key file.
The storage request for the Persistent Volume Claim that the container will make, has been reduced from 100GB to 30GB, to avoid issues if using the free trial of the Google Cloud Platform (avoids exhausting storage quotas).
No “sidecar” Container is defined for the same Pod as the “mongod” Container.

Now it’s time to deploy the MongoDB Service and StatefulSet. Run:

$ kubectl apply -f mongodb-service.yaml

Once this has run, you can view the health of the service and pods:

$ kubectl get all

Keep re-running the command above, until you can see that all 3 “mongod” pods and their containers have been successfully started (“Status=Running”).

You can also check the status of the Persistent Volumes, to ensure they have been properly claimed by the running “mongod” containers:

$ kubectl get persistentvolumes

Finally, we need to connect to one of the “mongod” container processes to configure the replica set and specify an administrator user for the database. Run the following command to connect to the first container:

$ kubectl exec -it mongod-0 -c mongod-container bash

This will place you into a command line shell directly in the container. If you fancy it, you can explore the container environment. For example you may want to run the following commands to see what processes are running in the container and also to see the hostname of the container (this hostname should always be the same, because a StatefulSet has been used):

$ ps -aux
$ hostname -f

Connect to the local “mongod” process using the Mongo Shell (it is only possible to connect unauthenticated from the same host that the database process is running on, by virtue of the localhost exception).

$ mongo

In the shell run the following command to initiate the replica set (we can rely on the hostnames always being the same, due to having employed a StatefulSet):

> rs.initiate({_id: "MainRepSet", version: 1, members: [
{ _id: 0, host : "mongod-0.mongodb-service.default.svc.cluster.local:27017" },
{ _id: 1, host : "mongod-1.mongodb-service.default.svc.cluster.local:27017" },
{ _id: 2, host : "mongod-2.mongodb-service.default.svc.cluster.local:27017" }
]});

Keep checking the status of the replica set, with the following command, until you see that the replica set is fully initialised and a primary and two secondaries are present:

> rs.status();

Then run the following command to configure an “admin” user (performing this action results in the “localhost exception” being automatically and permanently disabled):

> db.getSiblingDB("admin").createUser({
user : "main_admin",
pwd : "abc123",
roles: [ { role: "root", db: "admin" } ]
});

Of course, in a real deployment, the steps used above, to configure a replica set and to create an admin user, would be scripted, parameterised and driven by an external process, rather than typed in manually.

That’s it. You should now have a MongoDB Replica Set running on Kubernetes on GKE.

Run Some Quick Tests

Let just prove a couple of things before we finish:

1. Show that data is indeed being replicated between members of the containerised replica set.
2. Show that even if we remove the replica set containers and then re-create them, the same stable hostnames are still used and no data loss occurs, when the replica set comes back online. The StatefulSet’s Persistent Volume Claims should successfully result in the same storage, containing the MongoDB data files, being attached to by the same “mongod” container instance identities.

Whilst still in the Mongo Shell from the previous step, authenticate and quickly add some test data:

> db.getSiblingDB('admin').auth("main_admin", "abc123");
> use test;
> db.testcoll.insert({a:1});
> db.testcoll.insert({b:2});
> db.testcoll.find();

Exit out of the shell and exit out of the first container (“mongod-0”). Then using the following commands, connect to the second container (“mongod-1”), run the Mongo Shell again and see if the data we’d entered via the first replica, is visible to the second replica:

$ kubectl exec -it mongod-1 -c mongod-container bash
$ mongo
> db.getSiblingDB('admin').auth("main_admin", "abc123");
> db.setSlaveOk(1);
> use test;
> db.testcoll.find();

You should see that the two records inserted via the first replica, are visible to the second replica.

To see if Persistent Volume Claims really are working, use the following commands to drop the Service & StatefulSet (thus stopping the pods and their “mongod” containers) and re-create them again (I’ve included some checks in-between, so you can track the status):

$ kubectl delete statefulsets mongodb-statefulset
$ kubectl delete services mongodb-service
$ kubectl get all
$ kubectl get persistentvolumes
$ kubectl apply -f mongodb-service.yaml
$ kubectl get all

As before, keep re-running the last command above, until you can see that all 3 “mongod” pods and their containers have been successfully started again. Then connect to the first container, run the Mongo Shell and execute a query to see if the data we’d inserted into the old containerised replica-set is still present in the re-instantiated replica set:

$ kubectl exec -it mongod-0 -c mongod-container bash
$ mongo
> db.getSiblingDB('admin').auth("main_admin", "abc123");
> use test;
> db.testcoll.find();

You should see that the two records inserted earlier, are still present.

Summary

In this blog post I’ve shown how a MongoDB Replica Set can be deployed, using Kubernetes StatefulSets, to the Google Kubernetes Engine (GKE). Most of the outlined steps (but not all) are actually generic to any type of Kubernetes platform. Critically, I have shown how to ensure the Kubernetes based MongoDB Replica Set is secure by default, and how to ensure the Replica Set can operate normally, to be resilient to various types of system failures.

[Next post in series: Configuring Some Key Production Settings for MongoDB on GKE Kubernetes]

Song for today: Sun by The Hotelier