In today’s clusters made of commodity servers, failures are the norm. Having your code fail because one data node happened to be down when you were querying it is frustrating and no fun. Based on my experience hardening our product, I wanted to share a few quick and easy pointers that can help make your MongoDB queries more robust to node failures. Let’s get started.
Background for Beginners
Note: If you are completely new to MongoDB, I would recommend reading the following links. They should give you a high-level understanding of the components that make up a MongoDB cluster.
- Introduction to MongoDB Features
- How MongoDB Maintains Redundancy and Data Availability (a.k.a Replica Sets)
- How MongoDB Scales Horizontally (a.k.a Sharding)
Let’s first start by understanding PyMongo and its advantages. As you already know, Python is a high-level programming language that provides code readability with object-oriented programming. Its ease of syntax and code written are more readable and smaller than code written in other languages. And MongoDB is a document-oriented scalable database designed for handling big data applications (such as content management systems). Since MongoDB supports dynamic schemas, it also provides the flexibility of storing any kind of data into your database without any rigidity.
So, in case you have never used PyMongo before, the PyMongo distribution contains tools for interacting with MongoDB databases with Python. It is the most efficient tool for working with MongoDB using the utilities of Python. PyMongo was created to incorporate the advantages of Python as a programming language and MongoDB as a database. Since Python provides an easy-to-implement way of programming and MongoDB can handle large document repositories, the PyMongo tool can be of great use, since it provides the best of these two technologies.
At the very core of PyMongo is the MongoClient object, which is used to make connections and queries to a MongoDB database cluster. It can be used to connect to a standalone mongod instance, a replica set or mongos instances. To specify which mongo endpoints to contact, you simply pass in the endpoints into the host argument of the object. For example, if we had a standalone mongod node on 10.0.0.3 listening on port 27017, you can contact that node as follows:
import pymongo
client = pymongo.MongoClient(host=“10.0.0.3:27017”)
The node passed into the host is called a seed node. Once you initialize the MongoClient, it will connect to the specified seed node in the background. With some basics, let’s talk about how to make your queries more robust!
Tip 1: Increase Your Host Seed List
The first tip is very simple, but quite useful. Instead of passing in a single seed node, pass in a list of seed nodes. Passing in a list of seed nodes gives the MongoClient object more endpoints to contact in case the connection to one of the input nodes is down. This not only makes initializing a mongo connection more robust, but also makes the client robust to later connection failures occurring after initialization. Code-wise, it would look like the following:
client = pymongo.MongoClient(host=“10.0.0.3:27017”) # Fragile; single point of failure
client = pymongo.MongoClient(host=[”10.0.0.3:27017”, “10.0.0.3:27018”, “10.0.0.3:27019”]) # More robust; can contact two other nodes in case one goes down
Note that all nodes in your seed list must be part of the same logical group. For example, the seed list must contain mongod nodes belonging to the same replica set, or mongos query routers connected to the same config servers, or config servers in the same sharded cluster. Mixing these up will result in unpredictable behavior.
Tip 2: If at First You Don’t Succeed, Try, Try Again
After initializing a connection to a mongo node, the connection may fail at any time. This can be caused by a large variety of factors, ranging from network connectivity issues to actual mongo node failures. If one of these failures occurs, the next time a query is issued, the MongoClient will try to contact the unavailable node, determine it is unreachable, and then throw an exception. Afterward, the MongoClient object will automatically try to connect in the background to another live node in the seed list you provided, meaning your subsequent queries are likely to succeed. However, your code is still left with an exception to handle.
A clean and easy way to handle these situations is to wrap all your queries in a retry decorator that will catch the exception and then retry your query. Here is some sample code:
def retry(num_tries, exceptions):
def decorator(func):
def f_retry(*args, **kwargs):
for i in xrange(num_tries):
try:
return func(*args, **kwargs)
except exceptions as e:
continue
return f_retry
return decorator
# Retry on AutoReconnect exception, maximum 3 times
retry_auto_reconnect = retry(3, (pymongo.errors.AutoReconnect,))
def get_count(client, db, collection):
return client[db][collection].count()
By wrapping your queries in a retry decorator, you can make them more robust to node failures and other temporary failures such as primary re-elections. Granted, this method isn’t a guarantee your queries will succeed in all cases as it assumes that there are other healthy nodes in the cluster to fail over to. However, assuming a relatively stable cluster, this method handles a large number of failures.
Tip 3: Catch the Right Exceptions
In the above example, we retried on AutoReconnect exceptions, which is raised when a connection to a database is lost. While this is one of the exceptions to catch, it is by no means complete. Other exceptions can be found in the pymongo documentation (see link below). A couple ones you may want to also catch include:
- pymongo.errors.ConnectionFailure: thrown whenever a connection to a database cannot be made (actually the super-class of AutoReconnect above)
- pymongo.errors.ServerSelectionTimeoutError: thrown whenever the query you issue cannot find a node to serve the request (e.g. issuing a read on primary, when no primary exists)
These two exceptions are useful in both unsharded and sharded MongoDB clusters. However, there is some additional complexity in handling node failures for sharded clusters. In sharded clusters, most of your queries should be sent through the mongos because it can route your queries to the appropriate shard. If the mongos instance fails, then aConnectionFailure will be raised. However, what happens if your connection to the mongos is healthy, but the mongos fails to make a connection to a replica set? In these cases, the mongos will aggregate all the errors and send back an error which is then raised as an OperationFailure.
While it may seem simple to just also retry OperationFailures, it’s not quite that clean because OperationFailures can represent a large variety of errors, ranging from retryable errors such as primary re-elections to non-retryable errors such as authentication errors. You can distinguish between different types of errors by examining the code field of the error. Although at some point in time there was a list of error codes and their meanings, they have since become outdated and removed (see SERVER-24047), which means you’ll have to do some trial and error to discover which codes you should retry on.
Conclusion
Hopefully, these tips were helpful for all you MongoDB lovers! If you have any questions, comments, or have your own tips to share, feel free to comment below and get the discussion flowing.