Sunday, June 25, 2017

Deploying a MongoDB Replica Set as a GKE Kubernetes StatefulSet

[Part 1 in a series of posts about running MongoDB on Kubernetes, with the Google Kubernetes Engine (GKE). See the GitHub project gke-mongodb-demo for an example scripted deployment of MongoDB to GKE, that you can easily try yourself. The gke-mongodb-demo project combines the conclusions from all the posts in this series so far. Also see: http://k8smongodb.net/]


Introduction

A few months ago, Sandeep Dinesh of Google wrote an informative blog post about Running MongoDB on Kubernetes with StatefulSets on Google’s Cloud Platform. I found this to be a great resource to bootstrap my knowledge of Kubernetes’ new StatefulSets feature, and food for thought for approaches for deploying MongoDB on Kubernetes generally. StatefulSets is Kubernetes’ framework for providing better support for “stafeful applications”, such as databases and message queues. StatefulSets provides the capabilities of stable unique network hostnames and stable dedicated network storage volume mappings, essential for a database cluster to function properly and for data to exist and outlive the lifetime of inherently ephemeral containers.

My view of the approach in the Google blog post, is it is a great way for a developer to rapidly spin up a MongoDB Replica Set, to quickly test that their code still works correctly (it should) in a clustered environment. However, the approach cannot be regarded as a best practice for deploying MongoDB in Production, for mission critical use cases. This assertion is not a criticism, as the blog post is obviously intended to show the art of the possible (which it does very eloquently), and the author makes no claim to be a seasoned MongoDB administration expert.

So what are the challenges for Production deployments, in the approach outlined in the Google blog post? Well there are two problems, which I will address in this post:
  1. Use of  a MongoDB/Kubernetes sidecar per Pod, to control Replica Set configuration. Essentially, the sidecar wakes up every 5 seconds, checks which MongoDB pods are running and then reconfigures the replica-set, on the fly. It adds any MongoDB servers it can see, to the replica set configuration, and removes any servers it can no longer see. This is dangerous for many reasons. I’ve highlighted two of the most important reasons why here*:
    • This introduces the real risk of split-brain, in the event of a network partition. For example, normally, if there is a 3 node replica set configured and the primary is somehow separated from the secondaries, the primary will step down as it can’t maintain a majority. Normally, the two secondaries that can now only see each other, will form a quorum and one of these two will then become the primary. In the sidecar implementation, during a network split, the sidecar on the primary believes the two secondaries aren’t running and it re-configures the replica set on the fly, to now just have one member. This remaining member believes it can still act as primary (because it has achieved a majority of 1 out of 1 votes). The sidecars still running on the other two members, now also reconfigure the replica set to be just those two members. One of these two members automatically becomes a primary (because it has achieved a majority of 2 out of 2 votes). As a result there are now two primaries in existence for the same replica-set, which a normal and properly configured MongoDB cluster would never allow to occur. MongoDB’s strong consistency guarantees are subverted and non-deterministic things will start happening to the data. In a properly deployed MongoDB cluster, if there is a 3 node replica set and 2 nodes appear to be down, it doesn’t mean you now have a 1 node replica set, you don’t. You still have a 3 node replica-set, albeit only one replica appears to be currently running (and hence no primary is permitted, to guarantee safety and strong consistency).
    • Many applications updating data in MongoDB will use “WriteConcerns” set to a value such as “majority”, to provide levels of guarantee for safe data updates across a cluster. The whole notion of a “WriteConcern” would become meaningless in the sidecar controlled environment, because the constantly re-configured replica set would always reflect a total replica-set size of just those replicas currently active and reachable. For example, performing a database update operation with “WriteConcern” of “majority” would always be permitted, regardless of whether all 3 replicas are currently available, or just 2 replicas are or just 1 replica is.
  2. Insecure by default, due to authentication not being enabled. In a Production environment, running MongoDB with authentication disabled should never be allowed. Even if the intention is to configure authentication as a later provisioning step, the database is potentially exposed and insecure for seconds, minutes or longer. As a result, the “mongod” process should always be started with authentication enabled (e.g. using “--auth” command line flag), even during any “bootstrap provisioning process”. MongoDB’s localhost exception should be relied upon to securely configure one or more database users.
* If this was such an easy and safe thing to do, MongoDB replicas would be built to automatically perform these re-configurations, themselves, in a separate background thread running inside each “mongod” replica process. The brain controlling what the replica-set configuration should look like, lives outside the cluster for good reason (e.g. inside the head of an administrator, or preferably, inside a configuration file that is used to drive a higher level orchestration tool which operates above the “containers layer”).

Additionally, there are number of other considerations that aren’t just specific to the approach in the referenced Google blog post, but are applicable to the use of Docker/Kubernetes with MongoDB, generally. These consideration can be categorised as ways to ensure that MongoDB’s best practices are followed, as documented in MongoDB’s Production Operations Checklist and Production Notes. I address some of these best practice omissions in the next post in this series: Configuring Some Key Production Settings for MongoDB on GKE Kubernetes. It is probably worth me being clear here, that I am not claiming my blog series will get users 100% to where they need to be, to deploy a fully operational, secure and well-performing MongoDB Clusters on GKE. Instead, what I hope the series will do, is enable users to build on my findings and recommendations, so there are less gaps for them to address, when planning their own production environment.

For the rest of this blog post, I will focus on the steps required to deploy a MongoDB Replica Set, on GKE, addressing the replica-set resiliency and security concerns that I've highlighted above.


Steps to Deploy MongoDB to GKE, using StatefulSets

The first thing to do, if you haven’t already, is sign up to use the Google Cloud Platform (GCP). To keeps things simple, you can sign up to a free trial for GCP. Note: The free trial places some restrictions on account resource quotas, in particular restricting storage to a maximum of 100GB. Therefore, in my series of the blog posts and my sample GitHub project, I employ modest disk sizes, to remain under this threshold.

Once your GCP account is activated, you should download and install GCP’s client command line tool, called “gcloud”, to your local Linux/Windows/Mac workstation.

With “gcloud” installed, run the following commands to configure the local environment to use your GCP account, to install the main Kubernetes command tool (“kubectl”), to configure authentication credentials, and to define the default GCP zone to be deployed to:

$ gcloud init
$ gcloud components install kubectl
$ gcloud auth application-default login
$ gcloud config set compute/zone europe-west1-b

Note: If you want to specify an alternative zone to deploy to in the above command, you can first view the list of available zones by running the command:  $ gcloud compute zones list

You should now be ready to create a brand new Kubernetes cluster to the Google Kubernetes Engine. Run the following command to provision a new Kubernetes cluster called “gke-mongodb-demo-cluster”:

$ gcloud container clusters create "gke-mongodb-demo-cluster"

As part of this process, a set of 3 GCE VM instances are automatically provisioned, to run Kubernetes cluster nodes ready to host pods of containers.

You can view the state of the deployed Kubernetes cluster using the Google Cloud Platform Console (look at both the “Kubernetes Engine” and the “Compute Engine” sections of the Console).


Next, lets register GCE’s fast SSD persistent disks to be used in the cluster:

$ cat gce-ssd-storageclass.yaml

kind: StorageClass
apiVersion: storage.k8s.io/v1beta1
metadata:
  name: fast
provisioner: kubernetes.io/gce-pd
parameters:
  type: pd-ssd

$ kubectl apply -f gce-ssd-storageclass.yaml

Then run the commands to allocate 3 lots of Google Cloud storage, of size 30GB, using the fast SSD persistent disks, followed by a query to show the status of those newly created disks:

$ gcloud compute disks create --size 30GB --type pd-ssd pd-ssd-disk-1
$ gcloud compute disks create --size 30GB --type pd-ssd pd-ssd-disk-2
$ gcloud compute disks create --size 30GB --type pd-ssd pd-ssd-disk-3
$ gcloud compute disks list

Now, declare 3 Kubernetes “Persistent Volume” definitions, that each reference one of the storage disks just created:

$ cat gce-ssd-persistentvolume1.yaml

apiVersion: "v1"
kind: "PersistentVolume"
metadata:
  name: data-volume-1
spec:
  capacity:
    storage: 30Gi
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: fast
  gcePersistentDisk:
    pdName: pd-ssd-disk-1

$ kubectl apply -f gce-ssd-persistentvolume1.yaml

(repeat for Disks 2 and 3, using similar files, “gce-ssd-persistentvolume2.yml” and “gce-ssd-persistentvolume3.yml” respectively, with the fields “name: data-volume-?” and “pdName: pd-ssd-disk-?” set in each file)

Once the three Persistent Volumes are configured, their status can be viewed with the following command:

$ kubectl get persistentvolumes

This will show that the state of each volume is marked as “available” (i.e. no container has staked a claim on each yet).

A key deviation from the original Google blog post, is enabling MongoDB authentication immediately, before any "mongod" processes are started. Enabling authentication for a MongoDB replica set doesn’t just enforce authentication of applications using MongoDB, but also enforces internal authentication for inter-replica communication. Therefore, lets generate a keyfile to be used for internal cluster authentication and register it as a Kubernetes Secret:

$ TMPFILE=$(mktemp)
$ /usr/bin/openssl rand -base64 741 > $TMPFILE
$ kubectl create secret generic shared-bootstrap-data –from file=internal-auth-mongodb-keyfile=$TMPFILE
$ rm $TMPFILE

This generates a random key into a temporary file and then uses the Kubernetes API to register it as a Secret, before deleting the file. Subsequently, the Secret will be made accessible to each “mongod”, via a volume mounted by each host container.

For the final Kubernetes provisioning step, we need to prepare the definition of the Kubernetes Service and StatefulSet for MongoDB, which, amongst other things, encapsulates the configuration of the “mongod” Docker container to be run.

$ cat mongodb-service.yaml

apiVersion: v1
kind: Service
metadata:
  name: mongodb-service
  labels:
    name: mongo
spec:
  ports:
  - port: 27017
    targetPort: 27017
  clusterIP: None
  selector:
    role: mongo
---
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
  name: mongod
spec:
  serviceName: mongodb-service
  replicas: 3
  template:
    metadata:
      labels:
        role: mongo
        environment: test
        replicaset: MainRepSet
    spec:
      terminationGracePeriodSeconds: 10
      volumes:
        - name: secrets-volume
          secret:
            secretName: shared-bootstrap-data
            defaultMode: 256
      containers:
        - name: mongod-container
          image: mongo
          command:
            - "mongod"
            - "--bind_ip"
            - "0.0.0.0"
            - "--replSet"
            - "MainRepSet"
            - "--auth"
            - "--clusterAuthMode"
            - "keyFile"
            - "--keyFile"
            - "/etc/secrets-volume/internal-auth-mongodb-keyfile"
            - "--setParameter"
            - "authenticationMechanisms=SCRAM-SHA-1"
          ports:
            - containerPort: 27017
          volumeMounts:
            - name: secrets-volume
              readOnly: true
              mountPath: /etc/secrets-volume
            - name: mongodb-persistent-storage-claim
              mountPath: /data/db
  volumeClaimTemplates:
  - metadata:
      name: mongodb-persistent-storage-claim
      annotations:
        volume.beta.kubernetes.io/storage-class: "fast"
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 30Gi

You may notice that this Service definition varies in some key areas, from the one provided in the original Google blog post. Specifically:
  1. A “Volume” called “secrets-volume” is defined, ready to expose the shared keyfile to each of the “mongod” replicas that will run.
  2. Additional command line parameters are specified for “mongod”, to enable authentication (“--auth”) and to provide related security settings, including the path where “mongod” should locate the keyfile on its local filesystem.
  3. In the “VolumeMounts” section, the mount point path is specified for the Volume that holds the key file.
  4. The storage request for the Persistent Volume Claim that the container will make, has been reduced from 100GB to 30GB, to avoid issues if using the free trial of the Google Cloud Platform (avoids exhausting storage quotas).
  5. No “sidecar” Container is defined for the same Pod as the “mongod” Container.
Now it’s time to deploy the MongoDB Service and StatefulSet. Run:

$ kubectl apply -f mongodb-service.yaml

Once this has run, you can view the health of the service and pods:

$ kubectl get all

Keep re-running the command above, until you can see that all 3 “mongod” pods and their containers have been successfully started (“Status=Running”).

You can also check the status of the Persistent Volumes, to ensure they have been properly claimed by the running “mongod” containers:

$ kubectl get persistentvolumes

Finally, we need to connect to one of the “mongod” container processes to configure the replica set and specify an administrator user for the database. Run the following command to connect to the first container:

$ kubectl exec -it mongod-0 -c mongod-container bash

This will place you into a command line shell directly in the container. If you fancy it, you can explore the container environment. For example you may want to run the following commands to see what processes are running in the container and also to see the hostname of the container (this hostname should always be the same, because a StatefulSet has been used):

$ ps -aux
$ hostname -f

Connect to the local “mongod” process using the Mongo Shell (it is only possible to connect unauthenticated from the same host that the database process is running on, by virtue of the localhost exception).

$ mongo

In the shell run the following command to initiate the replica set (we can rely on the hostnames always being the same, due to having employed a StatefulSet):

> rs.initiate({_id: "MainRepSet", version: 1, members: [
       { _id: 0, host : "mongod-0.mongodb-service.default.svc.cluster.local:27017" },
       { _id: 1, host : "mongod-1.mongodb-service.default.svc.cluster.local:27017" },
       { _id: 2, host : "mongod-2.mongodb-service.default.svc.cluster.local:27017" }
 ]});

Keep checking the status of the replica set, with the following command, until you see that the replica set is fully initialised and a primary and two secondaries are present:

> rs.status();

Then run the following command to configure an “admin” user (performing this action results in the “localhost exception” being automatically and permanently disabled):

> db.getSiblingDB("admin").createUser({
      user : "main_admin",
      pwd  : "abc123",
      roles: [ { role: "root", db: "admin" } ]
 });

Of course, in a real deployment, the steps used above, to configure a replica set and to create an admin user, would be scripted, parameterised and driven by an external process, rather than typed in manually.

That’s it. You should now have a MongoDB Replica Set running on Kubernetes on GKE.


Run Some Quick Tests

Let just prove a couple of things before we finish:

1. Show that data is indeed being replicated between members of the containerised replica set.
2. Show that even if we remove the replica set containers and then re-create them, the same stable hostnames are still used and no data loss occurs, when the replica set comes back online. The StatefulSet’s Persistent Volume Claims should successfully result in the same storage, containing the MongoDB data files, being attached to by the same “mongod” container instance identities.

Whilst still in the Mongo Shell from the previous step, authenticate and quickly add some test data:

> db.getSiblingDB('admin').auth("main_admin", "abc123");
> use test;
> db.testcoll.insert({a:1});
> db.testcoll.insert({b:2});
> db.testcoll.find();

Exit out of the shell and exit out of the first container (“mongod-0”). Then using the following commands, connect to the second container (“mongod-1”), run the Mongo Shell again and see if the data we’d entered via the first replica, is visible to the second replica:

$ kubectl exec -it mongod-1 -c mongod-container bash
$ mongo
> db.getSiblingDB('admin').auth("main_admin", "abc123");
> db.setSlaveOk(1);
> use test;
> db.testcoll.find();

You should see that the two records inserted via the first replica, are visible to the second replica.

To see if Persistent Volume Claims really are working, use the following commands to drop the Service & StatefulSet (thus stopping the pods and their “mongod” containers) and re-create them again (I’ve included some checks in-between, so you can track the status):

$ kubectl delete statefulsets mongodb-statefulset
$ kubectl delete services mongodb-service
$ kubectl get all
$ kubectl get persistentvolumes
$ kubectl apply -f mongodb-service.yaml
$ kubectl get all

As before, keep re-running the last command above, until you can see that all 3 “mongod” pods and their containers have been successfully started again. Then connect to the first container, run the Mongo Shell and execute a query to see if the data we’d inserted into the old containerised replica-set is still present in the re-instantiated replica set:

$ kubectl exec -it mongod-0 -c mongod-container bash
$ mongo
> db.getSiblingDB('admin').auth("main_admin", "abc123");
> use test;
> db.testcoll.find();

You should see that the two records inserted earlier, are still present.


Summary

In this blog post I’ve shown how a MongoDB Replica Set can be deployed, using Kubernetes StatefulSets, to the Google Kubernetes Engine (GKE). Most of the outlined steps (but not all) are actually generic to any type of Kubernetes platform. Critically, I have shown how to ensure the Kubernetes based MongoDB Replica Set is secure by default, and how to ensure the Replica Set can operate normally, to be resilient to various types of system failures.

[Next post in series: Configuring Some Key Production Settings for MongoDB on GKE Kubernetes]


Song for today: Sun by The Hotelier

21 comments:

Unknown said...

Great job on your Kubernetes deployment, Clean, simple, and hits all the nails right on the head. https://github.com/pkdone/gke-mongodb-demo


Utkarsh Yeolekar said...

How should i write a script to configure the replica set and creation of admin user ?

Unknown said...

See the example project referenced at the top of the blog for an example script. See: https://github.com/pkdone/gke-mongodb-demo/blob/master/scripts/configure_repset_auth.sh

Utkarsh Yeolekar said...

thanks Paul,

This script we are running manually, is there any options where we can associate the script with the pod itself, and when it is up, it will execute the script first. In that way, we will not to fire the script externally.

jan said...

https://github.com/pkdone/gke-mongodb-demo
This likely has what you're looking for, it's basically the guide and extra within a bunch of scripts.

Anonymous said...

Nice follow up! Have you tried using Operators to remove some of the manual setup involved? I agree the sidecar that pings every 5 seconds is not ideal at all, but it does remove manual configuration. Trying to figure out something more stable yet operationally automated.

https://www.kubestack.com/catalog/mongodb

harsha said...

Hello,

Trying to understand how to expose the mongo outside the cluster as a service. Can you please shed some light on this.

cfan6161 said...

If I try to run as a non-root User, I get a permission denied error from mongo when it tries to read the keyFile. How can I run mongo as a non-root User while still being able to read the keyFile?

Samuel Birocchi said...

Hello there! First of all, nice post! Very helpful.

I'm trying to create de replica set in my cluster, however I keep getting the following error:

[main] Error reading file /etc/secrets-volume/internal-auth-mongodb-keyfile: No such file or directory

I created the secret and used the yaml in your git, is there something I'm missing?

Thx

Unknown said...

[main] Error reading file /etc/secrets-volume/internal-auth-mongodb-keyfile: No such file or directory

Unknown said...

I deployed my production replica set using the sidecar.
Can I remove the sidecar from the StatefulSet so that it doesn't try to reconfigure the replica set now that it's already set up?

Tom Q Wu said...

Speaking to production, how come it doesn't mention enabling the SSL?

Riken said...

Can I access the cluster using its public IP from the app hosted on some other service? If yes, how I can do that?

Anonymous said...

Hi you are missing a dash in your secret generation script (--from-file) in your blog post that causes no data to be generated along with the secret however the secret is still created. This leads to the secret (still, even without data) being mounted with the incorrect permissions which a lot of people were commenting on here.

If anyone is looking for a local (minikube) implementation here you go:
https://github.com/MichaelScript/kubernetes-mongodb

Derrick said...

Thanks for the excellent post!

@ariel I removed the sidecar without any issues. I thought it was best practice to use it but now I know better :).

bhabs said...

Great article Paul. Just wanted a little help. I followed all your step including the rs.initiate(...) command, but when I do rs.status() in the mongo shell I get the following message. I also see the prompt as rs0:OTHER instead of rs0.PRIMARY>. Looks like for me this is not going through. Has anyone faced something similar? Appreciate your help in advance.
rs0:OTHER> rs.status()
{
"state" : 10,
"stateStr" : "REMOVED",
"uptime" : 5713,
"optime" : {
"ts" : Timestamp(1529899600, 1),
"t" : NumberLong(1)
},
"optimeDate" : ISODate("2018-06-25T04:06:40Z"),
"ok" : 0,
"errmsg" : "Our replica set config is invalid or we are not a member of it",
"code" : 93,
"codeName" : "InvalidReplicaSetConfig",
"operationTime" : Timestamp(1529899600, 1),
"$clusterTime" : {
"clusterTime" : Timestamp(1529899600, 1),
"signature" : {
"hash" : BinData(0,"mvEtDgCJFGrtzzkWZJyxtjKjeeg="),
"keyId" : NumberLong("6567070506720165889")
}
}
}

bhabs said...

Please ignore my previous ask/comment. I cloned from Paul's github repo instead of using the 'yaml' file given on this web page. I just had to comment out the following in 'mongodb-service.yaml' and it worked great for me.
#resources:
#requests:
#cpu: 1
#memory: 2Gi

I am trying to set this up in AWS and not in GCE. I don't know, why this CPU negotiation failed in AWS.
Thanks

Unknown said...

Hi, Paul Done.
First of all thanks for the great tutorial. I've 2 questions:

1. Whenever I try to check replica set is working properly, I run this commands.

$ kubectl exec -it mongod-1 -c mongod-container bash
$ mongo
> db.getSiblingDB('admin').auth("main_admin", "abc123");
> db.setSlaveOk(1); # If I skip this command I'm getting below results. Does it mean MongoDB replica set is working as intended or not?
> use test;
> db.testcoll.find();


MainRepSet:SECONDARY> show dbs
2018-08-19T13:22:54.251+0000 E QUERY [thread1] Error: listDatabases failed:{
"operationTime" : Timestamp(1534684966, 1),
"ok" : 0,
"errmsg" : "not master and slaveOk=false",
"code" : 13435,
"codeName" : "NotMasterNoSlaveOk",
"$clusterTime" : {
"clusterTime" : Timestamp(1534684966, 1),
"signature" : {
"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
"keyId" : NumberLong(0)
}
}
} :
_getErrorWithCode@src/mongo/shell/utils.js:25:13
Mongo.prototype.getDBs@src/mongo/shell/mongo.js:65:1
shellHelper.show@src/mongo/shell/utils.js:849:19
shellHelper@src/mongo/shell/utils.js:739:15
@(shellhelp2):1:1

2. What's the mongodb uri to connect my app to this mongodb cluster?

Unknown said...

I'm having the same issue.

Sarthak said...

Indeed Good post.
My question is How Insert,Update,Delete query will go to same container when multiple container of mongod running in Kubernetes Cluster ? Any Help would be appreciated !!

Sarthak said...

What is prerequisite to run mongodb sharded cluster in inhouse kubernetes cluster ?