MongoDB 2.8 – New WiredTiger Storage Engine Adds Compression

CAVEAT: This post deals with a development version of MongoDB and represents very early testing. The version used was not even a release candidate – 2.7.9-pre to be specific, this is not even a release candidate.  Therefore any and all details may change significantly before the release of 2.8, so please be aware that nothing is yet finalized and as always, read the release notes once 2.8.0 ships.

Update (Nov 17th, 2014): Good news! I have re-tested with a patched version of 2.8.0-rc0 and the results are very encouraging compared to figure 2 below.  For full details (including an updated graph), see MongoDB 2.8: Improving WiredTiger Performance

Anyone that follows the keynotes from recent MongoDB events will know that we have demonstrated the concurrency performance improvements coming in version 2.8 several times now.  This is certainly the headline performance improvement for MongoDB 2.8, with concurrency constraints in prior versions leading to complex database/collection layouts, complex deployments and more to work around the per-database locking limitations.

However, the introduction of the new WiredTiger storage engine that was announced at MongoDB London also adds another capability with a performance component that has long been requested: compression

Eliot also gave a talk about the new storage engines at MongoDB London last week after announcing the availability of WiredTiger in the keynote.  Prior to that we were talking about what would be a good way to structure that talk and I suggested showing the effects and benefits of compression. Unfortunately there wasn’t enough time to put something meaningful together on the day, but the idea stuck with me and I have put that information together for this blog post instead.

It’s not a short post, and it has graphs, so I’ve put the rest after the jump.

Some Explanation Required

(If you already know why you want compression and just want to see the testing figures, click here to skip ahead)

I’ll get to the figures and graphs below to show you how the new compression options perform versus the original implementation below, but first a bit of preamble is warranted to explain just why this is seen as a must-have feature for so many people. After all, storage bits are cheap these days, and getting cheaper – you can buy a couple of terabytes of disk for sub-$100 so why do people care so much about compressing data and saving bits?

As it turns out, the question is more subtle than it appears at first.  Compression does not just save you disk space, it also saves you IO bandwidth, and that can be extremely hard to scale up.  Additionally, it is often far easier to scale CPU capacity than it is IO bandwidth.  This is particularly true in the operational database space, CPU is rarely your bottleneck whereas IO capacity is something you regularly need to budget carefully.

Hence, if you could trade CPU capacity for IO capacity, you could potentially have a far more cost effective and vertically scalable deployment.  That is essentially what compression gets you – you trade CPU cycles (compression) to reduce the amount of data you have to send to disk (and read from disk for that matter).  There is generally a small performance hit in terms of latency, but this is usually negligible compared to what saturated IO would cause.

Therefore, prior to the addition of compression to the storage engine capabilities with WiredTiger, when a MongoDB user hits this kind of scenario, where limited IO bandwidth means performance problems their only solution is to get more IO bandwidth, either by increasing IO capacity or adding shards for horizontal scaling, both of which add cost and/or complexity to the system.

Now, that same user can choose to compress data, use CPU cycles instead, and they have some options for how much compression they wish to apply.

On to the Testing

With all that explained, on to the testing and the results.  For anyone not aware, WiredTiger as part of MongoDB 2.8 currently gives you 3 options in terms of compression, in ascending order of compression applied:

The first question to ask is: how much compression do I get?

To answer that question, we first need a set of data to test with.  This gist shows how I generated just under 16GiB (plus indexes) of data for each storage engine configuration.  I tested that data set with the following MongoDB configurations:

  1. Engine: mmap v1 (default in 2.6), Compression: None (YAML Config)
  2. Engine: WiredTiger, Compression: None (YAML Config)
  3. Engine: WiredTiger, Compression: snappy (YAML Config)
  4. Engine: WiredTiger, Compression: zlib (YAML Config)

The test was simple: start up the instance, insert the data, shutdown the instance, check the disk usage.  The results in terms of disk space used are shown in figure 1 below:

Fig 1: Disk usage for several MongoDB engine/compression combinations

I think the figures show that snappy was a decent choice for the WiredTiger default in MongoDB, given that it is generally considered to be significantly lighter in terms of resource usage versus zlib.  I’ll leave it to others to prove that out, but there are plenty of posts outlining the relative cost and performance of these two (and other) compression methods.

What the numbers certainly show is a significant quality of life improvement for those MongoDB users out there that are running into issues with disk space utilization and looking to be more efficient about how they persist data to disk.

Performance Impact

The next question is: how does this help my performance?

That’s really going to vary, a lot, depending on your use case.  If you are not constrained on IO bandwidth, then it likely won’t make much difference.  However, if that is a constraint on your system, then we should be able to get a very basic approximation of the impact by simply seeing how long it takes to fetch that ~15.5GiB off disk and into memory.  To accomplish this, I did the following:

  • Placed each data set created above on a relatively slow USB attached portable drive
  • Started MongoDB with the same options as above
  • Dropped caches (just in case)
  • Read in all data in the collection into memory (using an explain)
  • Recorded the time taken to complete the operation

Figure 2 below shows the results of that testing:

fig 2: Time taken to load data set into memory

That’s somewhat surprising – there is an improvement between the different WiredTiger versions themselves (expected), but not necessarily any overall improvement versus the old mmap engine (though zlib does in fact beat it).  However, the testing I am doing upon reflection is probably not the best test case, even for an approximation – a sequential read from disk is, after all, what spinning disks are good at – even the external USB drive I used was able to outpace the benefits of compression, so I will need to come up with some better testing to get to a real conclusion.  It’s also not the most common general access pattern for MongoDB data (random access is), but I’ll need to do some work to get to a nice representation of that from my test data.  Once I have that done, I will update the post and add the numbers.