MongoDB Data Management

A MongoDB fitness plan to avoid death-by-disk-overflow

We all know the scenario, after your initial deployment you’re using your database without incident for a few months, and then you notice that your data has grown, and it’s huge! And now you need to delete some data. How do you do that safely in MongoDB? Here are a few tips and tricks.

There are different operations to get rid of data from your collections, the first is pretty simple `db.foo.drop()`. This will drop a collection entirely and get rid of all the indexes and it will be pretty performant. This happens in one atomic operation, and replicates over to any slaves as just one database command. Compare this to `db.foo.remove()`, which is a dangerous query in that rather than just removing a reference to a collection, each document in the collection is dropped one by one. This operation will hurt your performance and hold write locks for a large period of time. Not only that, but the performance is the same on any members of your replica set, the command to remove each document is sent through the oplog to all members.

`db.foo.remove({name:"nick"})` is a more selective operation, and you can apply a criteria to your deletes. This sort of strategy works ok for small collections, but as your data grows, then this has the same inherent risks as a `db.foo.remove()`. It will hold a write lock very aggressively and hurt performance on other operations against your database.

So what are my options?

Capped Collections

Deleting data can be painful, MongoDB comes with a couple of useful tools that can help. Capped Collections are a type of collection that have a fixed size. Check out the documentation at http://docs.mongodb.org/manual/core/capped-collections/. Think of this as a circular list. You specify how big you want this collection to be, and you can keep on inserting. When the collection is full, the newer data will overwrite the oldest data. It’s a great tool, but there are restrictions. You can’t delete data manually from a capped collection, and you can’t shard a capped collection. Capped collections are great for things like storing logs, etc.

TTL indexes

MongoDB came out with a new feature in version 2.2, called TTL Collections, or TTL indexes. This is a feature whereby you can tell mongodb to expire data after a certain amount of time, or at a certain time. Have a look at the documentation at http://docs.mongodb.org/manual/tutorial/expire-data/ . There is a process within mongo that watches your collection and periodically removes data.

Put your data in different collections

If you’re storing data that has a strong correlation to time in mongo, and you plan on deleting old data frequently, then consider creating a new collection for every day’s worth of data rather then having one large collection. We have already discussed the difference between remove and drop. Rather then running a painful remove against a large collection to get rid of last week’s data, you can just drop the old collections and move on. This does have the downside that your application needs to be able to find data in different collections, so there is a tradeoff. Also important to note is that mongo has a limit on the number of collections in a database (The limit is on namespaces actually, there is a default limit of 24k namespaces. This includes indexes and collections). If you are creating thousands of collections, then this probably isn’t the route for you.

Be kind to your disks, usePowersOf2Sizes

`usePowersOf2Sizes` is the hand-grenade of disk reuse in MongoDB. Basically, all documents, when created, allocate disk equal to the next power of 2 value in bytes. Because documents are created with predicable sizes, the space from any deleted documents can be reused.

For instance, have a 78 kb document? Then, the space allocated for the document is 2^17 bytes (131,072 bytes). 78kb translates to 79,872 bytes . We call it a hand-grenade, because while you can reuse, it has the potential to create as much unused space. As we did here, 39% of the space allocated is unused by the document.

Summary

Whichever route you take to manage your data, make sure you understand your data and the constraints that your database provides. Test your operations out on a test database before you work on a production system. Make sure you try your data management strategy on small collections before you accidentally lock your production database by removing a few million documents.

  • mw

    good summary. thanks

  • Stemlaur

    What would you suggest to do on a database with files growing a lot because of massive deletes, and that reach DEATH-​​BY-​​DISK-​​OVERFLOW every two weeks ?

    - daily db.repair ?
    - daily compact ?

    • Chris

      Every two weeks, uh?I would try to do something with PowerOf2Sizes.My gut says to have a disk that is about 3x the dataSize due to the extra padding.