Sharding

Sharding is one of those mystical aspects of MongoDB that it take awhile to wrap ones head around. Basically sharding is a mechanism by which one can scale writes by distributing the writing to multiple primaries (shards). Each document has a shard key associated with it which decides on what primary the document lives.

Sharding Topology

In MongoDB sharding happens at the collection level. That is to say that you can have a combination of sharded and unsharded collections. Let’s look at a simple topology.

Simple Two Shard Topology

The application talks to the Mongos proxies to write to the sharded system.

When to Shard

One of the typical errors is to shard to early. The reason this can be a problem is that sharding requires the developer to pick a shard key for distribution of the writes and one can easily pick the wrong key early due to not knowing how the data needs to be accessed. This can cause reads to be inefficiently spread out causing unnecessary IO and CPU usage to retrieve the data. Once the collection is sharded with a key it can be very time consuming to undo it as all the data will have to migrated from one sharded collection to another rewriting the all the documents.

Let’s look at some reason you might want to Shard.

  1. Your Working Set no longer fits in the memory of you computer. Sharding can help more of your Working Set to be in memory by pooling the RAM of all the shards. Thus if you have a 20GB Working Set on a 16GB machine, sharding can split this across 2 machines or 32GB instead, keeping all of the data in RAM.

  2. Scaling the write IO. You need to perform more write operations than what a single server can handle. By Sharding you can balance out the writes across multiple computers, scaling the total write throughput.

Choosing a Shard Key

It’s important to pick a Shard key based on the actual read/write profile of your application to avoid inefficiencies in the application. That said there are a couple of tips that can help finding the right shard key.

Easily Divisible Shard Key

If the picked shard key is easily divisible it makes it easier for MongoDB to distribute the content among the shards. If a key has a very limited number of possible values it can lead to inefficient distribution of the documents causing an uneven amount of reads and writes to go to a small set of the shards.

High Randomness Shard Key

A key with high randomness will evenly distribute the writes and reads across all the shards. This works great if documents are self contained entities such as Users. However queries for ranges of document such as all users with age less than 35 years will require a scatter gather.

Single Shard Targeted Key

Picking a shard key that groups the documents together will make most of the queries go to a specific Shard, meaning one can avoid scatter gather queries. One possible example might be a geo application for the UK where the first part of the key includes the postcode and the second is the address. Due to the first part of the shard key being the postcode all documents for that particular sort key will end up on the same Shard, meaning all queries for a specific postcode will be routed to a single Shard.

The UK postcode works as it has a lot of possible values due to the resolution of postcodes in the UK. This means there will only be limited amount of documents in each chunk for a specific postcode. However if we where to do this for a US postcode we might find that each postcode includes a lot of addresses causing the chunks to be hard to split into new ranges. The effect is that MongoDB is less able to spread out the documents and it thus impacts performance.

Routing Shard Keys

Depending on your Shard key the routing will work differently. This is important to keep in mind as it will impact performance.

Type Of Operation Query Topology
Insert Must have the Shard key
Update Can have the Shard key
Query with Shard Key Routed to nodes
Query without Shard Key Scatter gather
Indexed/Sorted Query with Shard Key Routed in order
Indexed/Sorted Query without Shard Key Distributed sort merge

Inbox Example

Imagine a social Inbox. In this case we have two main goals

  1. Send new messages to it’s recipients efficiently
  2. Read the Inbox efficiently

We want to ensure we meet two specific goals. The first one is to write to multiple recipients on separate shards thus leveraging the write scalability. However for a user to read their email box, one wants to read from a single shard avoid scatter/gather queries.

Fan out write, Single shard Read

How does one go about getting the correct shard key. Let’s assume we have two collections inbox and users in our social database. Let’s do the collection sharding.

var db = db.getSisterDB('social');
db.shardCollection('social.inbox', {owner: 1, sequence: 1});
db.shardCollection('social.users', {user_name: 1});

Let’s write and read to the collections with some test data to show how we can leverage the sharding.

var db = db.getSisterDB('social');
var msg = {
  from: 'Christian',
  to: ['Peter', 'Paul'],
  sent_on: new Date(),
  message: 'Hello world'
}

for(var i = 0; i < msg.to.length; i++) {
  var result = db.users.findAndModify({
    query: { user_name: msg.to[i] },
    update: { '$inc': {msg_count: 1} },
    upsert: true,
    new: true
  })

  var count = result.msg_count;
  var sequence_number = Math.floor(count/50);
  db.inbox.update({ owner: msg.to[i], sequence: sequence} ),
    { $push: {messages: msg} },
    { upsert:true });
}

db.inbox.find({owner: 'Peter'})
  .sort({sequence: -1})
  .limit(2);

The first part delivers the message to all it’s recipients. First it updates the message count for the recipient and then pushes the message to the recipients mailbox (which is a embedded document). The combination of the Shard key being {owner: 1, sequence: 1} means that all new messages get written to the same chunk for an owner. The Math.floor(count/50) generation will split up the inbox into buckets of 50 messages in each.

This last aspect means that the read will route by owner directly to a single chunk on a single Shard avoiding scatter/gather and speeding up retrieval.

Multiple Identities Example

What if we need to lookup documents by multiple different identities like a username or an email address.

Take the following document.

var db = db.getSisterDB('users');
db.users.insert({
  _id: 'peter',
  email: 'peter@example.com'
})

If we shard by _id it means that only _id queries will be routed directly to the right shard. If we wish to query by email we have to perform a scatter/gather query.

There is a possible solution called document per identity. Let’s look at a different way of representing the information.

var db = db.getSisterDB('users');

db.identities.ensureIndex({ identifier: 1 }, { unique: true });

db.identities.insert({
  identifier: {user: 'peter'}, id: 'peter'});

db.identities.insert({
  identifier: {email: 'peter@example.com', user: 'peter'},
  id: 'peter'});

db.shardCollection('users.identities', {identifier: 1});
db.users.ensureIndex({ _id: 1}, { unique: true });
db.shardCollection('users.users'. { _id: 1});

We create a unique index for the identities table to ensure we cannot map two entries into the same identity space. Since identifier is a compound index we can not actually query directly to a shard using this key. So it’s still a single read to retrieve a user by _id and now we can retrieve a user by it’s email by performing two direct queries using the correct identifier. Let’s see how to do this for an email user lookup.

var db = db.getSisterDB('users');

var identity = db.identities.findOne({
  identifier: {
    email: 'peter@example.com'}});

var user = db.users.find({ _id: identity.id });

The first query locates the identity using the email, which is a routed query to a single shard, and the second query uses the returned identitiy.id field to retrieve the user by the shard key.