There’s Always Money in the Banana Stand MongoDB Oplog

… make sure you don’t burn it down.

Anyone who’s ever done much work with databases has experienced the “oh shit” moment where data goes missing, or appears corrupt, and there’s really no obvious reason why. We, as a company, are morally opposed to vanishing data and have gotten quite good at finding out where it’s gone to.

Usually it’s easy to figure out what happened, since Mongo logs drop commands for databases and collections. If you have a Rails app backed by Mongo, and wake up to an empty database one morning, it’s worth asking who ran rake db:seed against a live database.

There are more insidious cases, though, where data is only partially missing, or a document you swear you looked at a few weeks ago has a few less fields than it used to. To MongoDB’s credit, this is almost never the fault of the DB engine — more common is a developer error, a bug in a data library, or a race condition.

MongoDB’s oplog is one of the best tools for well aimed finger pointing. The oplog is a write ahead log that’s primarily used for replication. Inserts, updates, and deletes (and some commands) on the primary are logged into a capped collection, and used to “play forward” data files on secondaries.

Figuring out what you’ve got

Since oplogs are capped collections, they tend to forget what’s happened after a certain amount of time. We try and keep as big of oplog as possible, and for most databases there should be a minimum of several days worth of logged write operations. You can check your oplog length from the MongoDB shell:

PRIMARY> db.printReplicationInfo()
configured oplog size:   2048MB
log length start to end: 8896166secs (2471.16hrs)
oplog first event time:  Sat Sep 28 2013 05:28:52 GMT+0000 (UTC)
oplog last event time:   Thu Jan 09 2014 04:38:18 GMT+0000 (UTC)
now:                     Thu Jan 09 2014 07:08:59 GMT+0000 (UTC)

This is the oplog for one of our internal databases, and stores 100 days worth of operations. It’s actually a log of the entire history of writes to the DB, which we can tell by looking at the first entry:

PRIMARY> db.oplog.rs.find().sort({$natural: 1}).limit(1)[0]
{
  "ts" : Timestamp(1380346132, 1),
  "h" : NumberLong("-6895578725396474892"),
  "v" : 2,
  "op" : "n",
  "ns" : "",
  "o" : {
    "msg" : "Reconfig set",
    "version" : 2
  }
}

Changes for a specific document

The oplog is just a collection, with a well defined schema, so you can run queries against it like any other collection. If we know the _id of a document that’s suspiciously missing data, we can get a summary of write operations that affected it with a simple aggregation command:

PRIMARY> db.oplog.rs.aggregate([{$match: {ns: "compose.projects", $or: [{"o._id": ObjectId("527da1f73664650015000000")},{"o2._id": ObjectId("527da1f73664650015000000")}]}}, {$group: {_id: "$op", count: {$sum: 1}}}])
{
  "result" : [
    {
      "_id" : "u",
      "count" : 557
    },
    {
      "_id" : "i",
      "count" : 1
    }
  ],
  "ok" : 1
}

Write operations are converted to idempotent operations for oplog purposes. This means that the o2._id field of an update will always exist, even if the update conditions used entirely different fields. It’s a useful feature for this kind of digging.

Recognizing what went wonky

We’ve seen a few types of write patterns result in unexpected data for developers. Most are pretty simple to identify, there’s almost always a remove op (or a sequence of remove ops). Applications will also send whole document updates (db.update({_id: 12345}, {hi: true})) when they should be using the atomic operators ($set, $inc, etc). MongoMapper did this for a long time, leading to many cross eyed moments for first time users.

More nefarious, though, is the race condition, which will frequently happen with a set of concurrent web or worker processes that manipulate the same document. In a typical race condition, writes that happen in a quick succession can leave the document in an unexpected state.

Here’s a contrived command sequence from a pretend financial app you should never use for anything ever (it seems like many Bitcoin apps are built with this level of care):

// worker 1 loads an account doc
account = db.accounts.findOne({_id: 12345})
{_id: 12345, balance: 100, status: "college"} // retrieves account with a $100 balance

// worker 2 finds same id because the value hasn't changed yet
account = db.accounts.findOne({_id: 12345, balance: {$gte: 100}})
{_id: 12345, balance: 100} // retrieves same doc

// worker 1 deposits $1mm, updates balance and status
db.accounts.update({_id: account._id}, {$set: {status: "yolo", balance: 1000100}})

// worker 2 withdraws $100, unsets balance and status
db.accounts.update({_id: account._id}, {$unset: {$status: 1, balance: 1}})

Suddenly, the account document is inconsistent, and we’ve just caused a big pile of money to vanish (not to mention the short lived status label). This is really difficult to debug, in most cases, but is pretty obvious in the oplog.

Don’t burn it down

The oplog is only useful for this kind of analysis if it spans “the right” amount of time. This is something we monitor closely on our DBs (for a variety of reasons), and you should too. Keep as much around as you can.