Bulk updates for all

TL;DR MongoDB’s new Bulk API is remarkably consistent and fluent across the different languages’ drivers, giving everyone the chance to speed up those larger operations.

The new Bulk Update API for MongoDB is opening the way for performance improvements when using the latest MongoDB Shell or drivers. Bulk operations eliminate the to-and-fro between client and database, allowing the database to consume commands much faster. Of course, anything that makes your MongoDB-based application quicker is something that we at MongoHQ want to encourage.

What is especially notable about this new functionality is that the MongoDB team took the opportunity to introduce it as a fluent API. Fluent APIs are designed to be more readable. Rather than having functions which take a list of parameters, a function is broken down into smaller functions that take one parameter, chained together. For example, a fictional function that takes a name, threshold, and new value that traditionally would have read:

    updateNameFunction(name,threshold,value)

now expands out to:

    updateName(name).overThreshold(threshold).with(value)`

Fluent APIs nudge developers into writing more readable code that can withstand future API changes better. More readable code and less fragile code … both good things in our book.

For this post, we will start our tour of the various implementations of the Bulk API with the one everyone has easy access to: the Mongo Shell.

Bulk Shell

To start a bulk operation you need to get a BulkOp from the collection the operation applies to. These come in both ordered and unordered forms.

Ordered Bulk operations

Ordered bulk operations are stepped through in order (thus the name), halting when there’s an error.

    var bulkop=db.collection.initializeOrderedBulkOp() 

Unordered bulk operations

Unordered bulk operations are executed in no particular order (potentially in parallel) and these operations do not stop when an error occurs.

    var bulkop=db.collection.intializeUnorderedBulkOp()

The returned Bulk operation builder can have many commands added to it, but they will only be run when the .execute() method is invoked on it. Once executed, you will not be able to re-execute the same Bulk Op and will have to create a new one – this is particularly relevant when recovering from an error of some sort.

Commands

Now we can start adding database commands to our bulk operation. First up, is the insert which is the simplest of all the commands:

Inserting with bulk operations

Unlike the standalone insert command, this takes just one document for insertion rather than offering the option of passing it an array – with the Bulk API you just keep adding insert operations to get the same effect.

    bulkop.insert({ lastname:"Haley", firstname:"Bill" });
    bulkop.insert({ lastname:"Halley", firstname:"Edmond" });
    bulkop.insert({ lastname:"Bopp", firstname:"Hale"});

All other operations on the database start the same way, with a .find() function. Whether you are updating, removing or replacing one or many documents, it all starts with a find.

Find with bulk operations

The find operation takes one parameter, a query which can return any number of documents – the next command in the chain will set the scope of the operation, not the find part.The first set of commands work on one document, .updateOne(),.replaceOne() and .removeOne(). The will operate on the first document found to match the .find() query and, respectively, update the document using the given parameter, replace it with another document specified in the parameter, and finally, just remove the document – no parameter needed.

    bulkop.find({ lastname:"Haley" }).updateOne( { $set: { year:1925 } } );
    bulkop.find({ firstname:"Edmond"}).replaceOne({ lastname:"Blackadder", firstname:"Edmund", year:1455 });
    bulkop.find({ lastname:"Bopp" }).removeOne();

The .updateOne() and .removeOne() commands also have versions, .update() and .remove(),that work on multiple documents and are the equivalent of specifying “{ multi:true }” in the options on a non-bulk operation.

    bulkop.find({ lastname:"Blackadder" }).remove( );
    bulkop.find({ lastname:"Haley" }).update( {$set: { comet:false } } );

Upsert with bulk operations

There’s one other command you’ll want to know about. It, too, can go into the chain of commands before an update or a replace, and thats the .upsert() command. Important to note, this command changes the default upsert and behaves as if { upsert: true } were set in the options. Therefore, it creates a document based on the query and update/replace parameters if no document is found by the query.

The details of how the new document is created are explained on the MongoDB Bulk upsert manual page. The command looks like this:

    bulkop.find({ lastname:"Halle" }).upsert().update( { $set: { firstname:"Berry" }});

Execute and results

Once we’ve prepared our bulk operation, we can execute it. But, before we do, it is worth having a look at the bulkop itself. If we ran all the preceding commands, in the order they appear, the output would look like this:

> bulkop
{
 "nInsertOps": 3,
 "nUpdateOps": 4,
 "nRemoveOps": 2,
 "nBatches": 4
}

These stats are specific to the shell’s Bulk operation implementation; despite this, they are informative.

So, what are the numbers telling us?

The numbers tell us our operation has three inserts, three updates and two remove operations and that they have been split up into four batches. The shell batching mechanism was covered in an earlier part of this series, but for this example, the three insert statements form one batch, the next update and replace form another batch, the two consecutive removes make another batch, and finally the last update is the sole occupant of the final batch. With nBatches at 4, it means four round trips to the database when this bulk operation is executed. Important to note: An unordered bulk operation should always come out at three nBatches.

We are now ready to execute our bulk operation. The output, in the Mongo shell, looks like this:

> results=bulkop.execute();
BulkWriteResult({
  "writeErrors": [ ],
  "writeConcernErrors": [ ],
  "nInserted": 3,
  "nUpserted": 1,
  "nMatched": 3,
  "nModified": 3,
  "nRemoved": 2,
  "upserted": [
    {
      "index": 8,
      "_id": ObjectId("536258ad3649b3e9dc49c313")
    }
  ]
})

What we get back is a result full of information. Let’s start with the statistics:

nInserted & nUpserted

nInserted and nUpserted tell us how many new records were created and how many of those were the result of a non-matching update operation.

nMatched

nMatched gives us the total number of documents that matched any of the find() clauses and if those lead to documents being changed or removed.

nModified & nRemoved

If documents were changed or removed, counts are available in nModified and nRemoved.

upserted

Beyond the statistics though there’s some potentially useful information available such as the upserted array whichlists each upserted document’s given _id and an index number which could point at the particular operation in the Bulk op that caused the upsert to occur.

We say ‘could‘ above because if you look at the contents of bulk.getOperations() in the shell, you’ll find that those operations have already been batched up and only the index of the first element of the batch is preserved. Therefore, you’ll either want to write code to walk the operations table (which will depend on driver implementation) or adopt a strategy which doesn’t involve getting the id value from the upsert operation.

The shell, of course, is hardly the place where you’d be writing code with error recovery strategies, but as we’ll see, other drivers are no better on mapping these indexes to the bulk operations.

More on the errors array

This errors array is made up of documents which list, for each error, the index number of the operations (index), identifying code for the error (code), a readable error message (errmsg), and a document (op) which contains the values that made up the operation.

For an ordered bulk operation, there will only be one document here, by design, for as soon as an error occurs, the ordered bulk operation stops processing. For unordered bulk operations, there could be many error documents present as it continues to process after errors occur.

Finally, there’s writeConcernError, a document which details any write issues which have occurred but which didn’t stop the processing of the bulk update.

Other drivers

We’ve now covered the MongoDB shell’s implementation of the Bulk API, but if you are writing code which needs to have a some level of error recovery, then you aren’t going to be relying on the shell. So, let’s take a quick look at the other drivers available and see how they compare to the baseline that the MongoDB shell provides.

Node.js

We start with the official Node.js driver. You’ll need version 1.4.2 or later of the driver to get Bulk API support.The Node.js driver is, syntactically, very similar to the shell, with the exception that the execute function takes a callback to return results. What’s interesting is that the driver is implemented differently than the shell and, as such, currently offers no equivalent of .getOperations() and apparently no visibility, at all, of the operations queued up.

Node.js example

var bulkop=collection.initializeOrderedBulkOp();
bulkop.insert({ lastname:"Haley", firstname:"Bill" });
bulkop.insert({ lastname:"Halley", firstname:"Edmond" });
bulkop.insert({ lastname:"Bopp", firstname:"Hale" });
bulkop.find({ lastname:"Haley" }).updateOne( { $set: { year:1925 } } );
bulkop.find({ firstname:"Edmond" }).replaceOne({ lastname:"Blackadder", firstname:"Edmund", year:1455 });
bulkop.find({ lastname:"Bopp" }).removeOne();
bulkop.find({ lastname:"Blackadder" }).remove( );
bulkop.find({ lastname:"Haley" }).update( { $set: { comet:false } } );
bulkop.find({ lastname:"Halle" }).upsert().update( { $set: { firstname:"Berry" }});
results=bulkop.execute(function(err,result) {
    console.log(JSON.stringify(result));
});

Ruby

The Ruby Bulk API is introduced with version 1.10 of the native driver. Beyond the obvious difference that the API commands are underscored rather than camel-cased – initializeOrderedBulkOp() becomes initialize_ordered_bulk_op() – it is worth noting that the errors from the execute method are raised as an exception rather than as a result. Before execute is called,calling the inspect method on the (http://api.mongodb.org/ruby/current/Mongo/BulkWriteCollectionView.html)` will return details of all the queued up operations. After execution, that will be reset and contain the bulk operation statistics.

Ruby example

collection = db['updatetest']
begin
    bulkop=collection.initialize_ordered_bulk_op();

    bulkop.insert({  :lastname  => "Haley",  :firstname  =>"Bill" });
    bulkop.insert({  :lastname  => "Halley",  :firstname  => "Edmond" });
    bulkop.insert({  :lastname  => "Bopp", :firstname   => "Hale" });
    bulkop.find({  :lastname  => "Haley" }).update_one( { "$set" => { :year => 1925 } } );
    bulkop.find({  :firstname  => "Edmond" }).replace_one({  :lastname  => "Blackadder",  :firstname  => "Edmund", :year => 1455 });
    bulkop.find({  :lastname  => "Bopp" }).remove_one;
    bulkop.find({  :lastname  => "Blackadder" }).remove;
    bulkop.find({  :lastname  => "Haley" }).update( { "$set" => { :comet => false } } );
    bulkop.find({ :lastname => "Halle" }).upsert().update( { "$set" => { :firstname => "Berry" }});
    result=bulkop.execute;
    puts result.inspect;
rescue => bwe
    puts bwe.result;
end
mongo_client.close;

Python

The Python driver saw the Bulk API added in version 2.7. Like Ruby, method calls are down-cased with underscores in Python. The execute method returns separate results in a format very much as described in the shell and throws an exception when there is an error.Behind the scenes a BulkOperationBuilder class is used to compose the operation. For further examples consult the driver’s tutorial.

Python example

collection=db['updatetest']
bulkop=collection.initialize_ordered_bulk_op()
bulkop.insert({ 'lastname':'Haley', 'firstname':'Bill' })
bulkop.insert({ 'lastname':'Halley', 'firstname':'Edmond' })
bulkop.insert({ 'lastname':'Bopp', 'firstname':'Hale' })
bulkop.find({ 'lastname':'Haley' }).update_one( { '$set': { 'year':1925 } } )
bulkop.find({ 'firstname':'Edmond' }).replace_one({ 'lastname':'Blackadder', 'firstname':'Edmund', 'year':1455 })
bulkop.find({ 'lastname':'Bopp' }).remove_one()
bulkop.find({ 'lastname':'Blackadder' }).remove( )
bulkop.find({ 'lastname':'Haley' }).update( { '$set': { 'comet':0 } } )
bulkop.find({ 'lastname':'Halle' }).upsert().update( { '$set':{ 'firstname':'Berry' }});
try:
    result=bulkop.execute()
    pprint(result)
except BulkWriteError as bre:
    pprint(bre.details)

Java

Last but not least, we highlight the Bulk API implementation of the Java driver. The Bulk API was implemented in version 2.12 and, as you will see, there is no obvious differences from the other implementations. However, the fluency of the API is somewhat drowned out by the need to create BasicDBObjects when creating queries and defining updates.

Java Example

DBCollection collection=db.getCollection("updatetest");
BulkWriteOperation bulkop=collection.initializeOrderedBulkOperation();
bulkop.insert(new BasicDBObject("lastname","Haley").append("firstname","Bill" ));
bulkop.insert(new BasicDBObject("lastname","Halley").append("firstname","Edmond" ));
bulkop.insert(new BasicDBObject("lastname","Bopp").append("firstname","Hale" ));
bulkop.find(new BasicDBObject("lastname","Haley")).updateOne(new BasicDBObject("$set",new BasicDBObject("year",1925)));
bulkop.find(new BasicDBObject("firstname","Edmond")).replaceOne(new BasicDBObject("lastname","Blackadder").append("firstname","Edmund").append("year",1455));
bulkop.find(new BasicDBObject("lastname","Bopp")).removeOne();
bulkop.find(new BasicDBObject("lastname","Blackadder")).remove();
bulkop.find(new BasicDBObject("lastname","Haley")).update(new BasicDBObject("$set",new BasicDBObject("comet",false)));

try {
    BulkWriteResult result=bulkop.execute();
    System.out.println(result);
} catch (MongoException me) {
    System.out.println(me);
}       

Wrapping up

What is the most interesting in this new functionality is how MongoDB has implemented one common, fluent API across all of the MongoDB drivers. Apart from some language-centric casing and variations in how the results and errors are handled, the consistency of the API implementations are remarkably high.

From what we’ve found, the only major variation is around the returned index numbers for operation errors and upserts which can’t easily or consistently be used to look at the bulk operation as issued. If you want to write code that will do partial error recovery, you will probably have to keep your own index number tally when operations are added.

Remember to not mix different types of operations too liberally anyway (see a previous article in this series for more on that) and keep your bulk operations fairly homogenous.

If you do all that, you will see a huge performance boost for your bulkier updates. Hopefully, this detailed look has increased your MongoDB skills and given you good insight into this powerful new functionality.

Try out the MongoDB Bulk API on MongoHQ Elastic Deployments

If you don’t have a MongoHQ account already, signing up is easy! If you’re already a MongoHQ user, you can provision a new Elastic Deployment with the “Create Database” button.

Written by Dj Walker-Morgan

Content Curator at MongoHQ, Dj has been both a developer and writer since Apples came in ][ flavors and Commodores had Pets.