Thursday, December 19, 2019

Some Tips for Diagnosing Client Connection Issues for MongoDB Atlas

Introduction


   [UPDATE 07-Sep-2020: I've now written an executable binary tool you can run which performs the equivalent of the checks in this blog post to diagnose connectivity issues to Atas or any other type of MongoDB deployment, downloadable from here]

By default, for recent MongoDB drivers and client tools, MongoDB Atlas advertises the exposed URL for a deployed database cluster using a service name which maps to a set of DNS SRV records to provide an initial connection seed list. This results in a much more 'human digestible' URL, but more importantly, increases deployment flexibility and the ability for underlying database server hosts to migrate over time, without needing to subsequently reconfigure clients.

For example, an Atlas Cluster may be referenced in a connection string by:

 testcluster-abcd.mongodb.net

...as an alternative to the full connection endpoint list:

 testcluster-shard-00-00-abcd.mongodb.net:27017,testcluster-shard-00-01-abcd.mongodb.net:27017,testcluster-shard-00-02-abcd.mongodb.net:27017/test?replicaSet=TestCluster-shard-0

It is worth noting though, whichever approach is used (explicitly defining all endpoints in the connection string or having it discovered via the DNS SRV service name), the connection URL seed list is only ever used for bootstrapping a client application to the database cluster, when the client first starts or when it later needs to restart. On start-up, the client uses the connection seed list to attempt to attach to any member of the cluster, and in fact, all but one of the endpoints could be incorrect and a successful cluster connection will still be achieved. Once the initial connection is made, the true cluster member endpoint list is dynamically and continuously shared between the cluster and the client at runtime. This enables the client to continue operating against the database even if the members of the database cluster change locations or identities over time. For example, after a year of a database cluster and application continuously running, there could be the need to increase database capacity by dynamically rotating the database hosts to new higher processing capacity machines. This all happens dynamically and the already running client application automatically becomes aware and leverages the new hosts without downtime and without needing to consult the connection string again. If the client application restarts though, it will need to read the updated connection string to be able to bootstrap a connection back up to the database cluster.

In the rest of this post we will explore some of the ways initial client connectivity issues can be diagnosed and resolved when using DNS SRV based connection URLs. For reference, Joe Drumgoole provides a great explanation about how DNS SRV records work more generally, and how MongoDB drivers and tools can leverage these.

Naive Connectivity Diagnosis


If you are having connection problems with Atlas when using the SRV service name based URL, be weary of drawing the wrong conclusions regarding the cause of the connection problem...

For example, lets say you can't connect an application to a cluster with the Atlas advertised URL of 'mongodb+srv://testcluster-abcd.mongodb.net' from your laptop. You may be tempted to try to debug the connection problem by running some of the following commands from your laptop:

$ ping testcluster-abcd.mongodb.net
ping: testcluster-abcd.mongodb.net: Name or service not known

$ nc -zv -w 5 testcluster-abcd.mongodb.net 27017
nc: getaddrinfo for host "testcluster-abcd.mongodb.net" port 27017: Name or service not known

Neither of these work even if you actually do have Atlas connectivity configured correctly. This is because "testcluster-abcd.mongodb.net" is not the DNS name of a specific host endpoint. It is actually used by the MongoDB drivers and tools to dynamically lookup the DNS SRV records which have been populated for a service called 'testcluster-abcd.mongodb.net'.

Useful Connectivity Diagnosis


As documented in the MongoDB Drivers specification document and the MongoDB Manual, a DNS SRV query is performed by the drivers/tools by prepending the text '_mongodb._tcp.' to the service name. Therefore, to lookup the list of real endpoints for the Atlas cluster from your laptop using the DNS nslookup tool, you should run:

$ nslookup -q=SRV _mongodb._tcp.testcluster-abcd.mongodb.net
Server: 127.0.0.53
Address: 127.0.0.53#53

Non-authoritative answer:
_mongodb._tcp.testcluster-abcd.mongodb.net service = 0 0 27017 testcluster-shard-00-02-abcd.mongodb.net.
_mongodb._tcp.testcluster-abcd.mongodb.net service = 0 0 27017 testcluster-shard-00-01-abcd.mongodb.net.
_mongodb._tcp.testcluster-abcd.mongodb.net service = 0 0 27017 testcluster-shard-00-00-abcd.mongodb.net.

You can see that in this case that the database service name maps to 3 endpoints (i.e. the hosts of the 3 replica set members). You can then lookup the actual IP address of any one of these endpoints if you desire:

$ nslookup testcluster-shard-00-00-abcd.mongodb.net
Server: 127.0.0.53
Address: 127.0.0.53#53

Non-authoritative answer:
testcluster-shard-00-00-abcd.mongodb.net canonical name = ec2-35-178-15-240.eu-west-2.compute.amazonaws.com.
Name: ec2-35-178-15-240.eu-west-2.compute.amazonaws.com
Address: 35.178.14.238

So to now debug your connectivity issue further you can use ping but this time by specifying one of the underlying host server endpoints for the database cluster:

$ ping -c 3  testcluster-shard-00-00-abcd.mongodb.net
PING ec2-35-178-15-240.eu-west-2.compute.amazonaws.com (35.178.14.238) 56(84) bytes of data.
64 bytes from ec2-35-178-15-240.eu-west-2.compute.amazonaws.com (35.178.14.238): icmp_seq=1 ttl=51 time=10.2 ms
64 bytes from ec2-35-178-15-240.eu-west-2.compute.amazonaws.com (35.178.14.238): icmp_seq=2 ttl=51 time=9.73 ms
64 bytes from ec2-35-178-15-240.eu-west-2.compute.amazonaws.com (35.178.14.238): icmp_seq=3 ttl=51 time=11.7 ms

--- ec2-35-178-15-240.eu-west-2.compute.amazonaws.com ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2002ms
rtt min/avg/max/mdev = 9.739/10.586/11.735/0.850 ms

If this is successful it still doesn't necessarily mean that you can connect to the database service. The next thing to try is to see if you can actually open a socket connection to the mongod (or mongos) daemon process running on one of the endpoints, which you can achieve from your laptop using the netcat utility:

$ nc -zv -w 5 testcluster-shard-00-00-abcd.mongodb.net 27017
nc: connect to testcluster-shard-00-00-abcd.mongodb.net port 27017 (tcp) timed out: Operation now in progress

If this doesn't connect but you are able to ping the endpoint host (as is the case in this example), it probably indicates that the IP address of your client laptop has not been added to the Atlas project's whitelist, which is easy to remedy via the Atlas Console:


Once your laptop has been added to the whitelist, running netcat again should demonstrate that a socket connection can now be successfully made:

$ nc -zv -w 5 testcluster-shard-00-00-abcd.mongodb.net 27017
Connection to testcluster-shard-00-00-abcd.mongodb.net 27017 port [tcp/*] succeeded!

If this connects, then it is advisable to move on to trying to connect to the database via the Mongo Shell.


In this example screenshot, the Atlas console suggests the following Mongo Shell command line to use to connect:

 mongo "mongodb+srv://testcluster-abcd.mongodb.net/test" --username main_user

With this connection string, some of you may be thinking how does the Shell know to connect to Atlas over SSL/TLS, what replica-set name it should request and what authentication source database it should specify to locate the user's credentials?

Well, in addition to querying the DNS SRV records for the service, when dynamically constructing the initial bootstrap URL for the cluster, the MongoDB drivers/tools also lookup a DNS TXT record for the service which Atlas also populates for the deployed cluster. This TXT record contains the set of connection options, to be added as parameters to the dynamically constructed connecting string (e.g. 'ssl=true&replicaSet=TestCluster-shard-0&authSource=admin'). You can view what these parameter settings are for a particular Atlas cluster, yourself, by running the following DNS query:

$ nslookup -q=TXT testcluster-abcd.mongodb.net
Server: 127.0.0.53
Address: 127.0.0.53#53

Non-authoritative answer:
testcluster-abcd.mongodb.net  text = "authSource=admin&replicaSet=TestCluster-shard-0"

Note, the default behaviour for MongoDB drivers/tools using a 'mongodb+srv' based URL is defined as to enable SSL/TLS for the connection. As a result, 'ssl=true' doesn't have to be included in the DNS TXT record, as shown in the example above, because the drivers/tools will automatically add this parameter to the connection string on the fly.

Summary


There's other potential causes of MongoDB Atlas connectivity issues that aren't covered in this post, but hopefully the tips highlighted here will help some of you, especially if you are diagnosing problems when using DNS SRV based service names in the connection URLs you use.


Song for today: Lose the Baby by Tropical Fuck Storm

1 comment:

lokriet said...

Thanks a lot for your post! It should definitely be made more visible by google somehow :) I've been struggling with a mongo connection problem for weeks, looking into rather useless error messages in console and unhelpful (to me at least) Atlas troubleshooting pages, and your post turned me in the right direction in the end. Yay