From JSON to BSON and back

Although MongoDB is known as a JSON document database, under the hood those documents are actually stored as BSON, a binary variant of JSON. In this article we’ll look are why BSON is used and how the differences between BSON and JSON can become visible to a MongoDB user.

Understanding JSON is key to understanding why BSON exists. JSON was discovered by Douglas Crockford at the turn of the century as a way of easily moving data around. The JSON format was derived from JavaScript’s syntax and gave a simple way to move data between clients and servers, especially compared to the “popular” format of the time, XML. By extracting from JavaScript the curly and square brackets that defined sets and arrays along with the quoted strings and undecorated numbers, Crockford showed how you could have a simple serialization format.

So simple, you could de-serialize a JSON document by running it through JavaScript’s eval function. Never do that though as it lets bad guys potentially add JavaScript code to your JSON data and you end up having to filter all incoming JSON. If you don’t want to think about the can of worms that opens up, use a JSON library to safely, and more efficiently, parse JSON data and sidestep the entire problem.

Anyway, JSON was designed for being generated and decoded serially, and its great at that. What it’s less good at though is being stored and traversed. Consider a JSON document which looks like this:

{
  "name":"My Big Document",
  "content":"There's a huge block of text here and it goes on for miles and miles...."
},
{
  "name":"My Important Document",
  "content":"More content..."
}

If we were reading through this looking for “My Important Document”, like a JSON database might be asked to do, we would read the first “name”, it wouldn’t match and we’d have to carry on reading looking for the end of that “content” to get to the next document. Nobody wants to spend all their time reading through data on disk (or in memory) looking for boundaries.

Thats where BSON comes in – BSON is a binary version of JSON which adds two things to the mix, lengths and types. JSON documents are easy to encode as BSON and BSON can easily, with caveats, be turned back to JSON. The length information is added in BSON so that when we look at BSON data, we can see how long this current data item is and if we need to skip it, do it much more efficiently. The length itself is encoded as a 32 bit integer giving a notional 4GB capacity, but MongoDB caps the maximum document size at 16MB to avoid giant documents swamping memory.

While the length helps software navigate through the documents, the other addition, types, makes it easier for software to read particular values. To show how this works, lets look at a date in a JSON document created in the mongo shell:

> a={ "created": "Wed Feb 12 2014 15:28:31 GMT+0000 (GMT)" }
{ "created" : "Wed Feb 12 2014 15:28:31 GMT+0000 (GMT)" }
> typeof a.created
string

This is purely a string representation of a date and time plus time-zone. The field only has string manipulation functions and if we are using this information to search a collection based on date or time, its not going to be very efficient because each time the database examines a date, it will have to parse the string to create a date value which it can use to compare with other date values. This need to parse would be hard work for queries and would be even more burdensome when the database tried to sort or index on a date. It’s a lot of extra work that needs to be avoided.

There is, of course, a better way to represent dates which is eminently amenable to being compared with other dates values. JavaScript has a Date class which stores the date as a signed 64 bit integer, based on the number of milliseconds since Jan 1, 1970 and negative values representing dates before 1970 and adds a wealth of methods to make it easy to extract any part of the date.

The problem is though that despite the JS in JSON standing for JavaScript, JSON lacks any way of representing these dates – when Crockford was discovering JSON, he kept it simple and that meant not letting JavaScript’s class system seep into the format. This smart move has helped make JSON widely used, but does mean that if you want to store JavaScript dates efficiently, you have to step outside what JSON offers

There’s no standard way of serializing a date either; a Date could be saved as a integer of milliseconds since the start of 1970 or as any one of the many string representations of dates such as as an ISO-8601 string.

BSON takes on the problem and has a date type which stores just the 64 bit integer of the Date class and restores the value as a JavaScript Date. In the MongoDB shell, it’s actually presented as wrapped as an ISODate…

> b={ "created": new Date("Feb 12 2014 15:28:31 GMT") }
{ "created" : ISODate("2014-02-12T15:28:31Z") }
> typeof b.created
object

Note that, when working with MongoDB dates, ISODate only appears in the shell and can only be consumed by the shell. Everywhere else, Date’s are a Date instance. With BSON Dates being stored, date fields become an easily sortable, comparable and indexable number for the database.

Internally, the BSON format includes a single byte type value to indicate that the binary data stored is a date and, similarly, there are values for this byte to denote binary storage of floating point, integer, string, documents and arrays all of which can be found in JSON. There’s also other types of data stored that are outside the scope of JSON, such as ObjectIDs, binary data, regular expressions and internal timestamps which have their own identifier. Because, like the Date handling, they are outside the scope of JSON, there are special functions to create these data types in the shell.

Where the difference between JSON and BSON is most visible though is when data is imported or exported from MongoDB using the mongoimport and mongoexport commands. These commands deal in what the Mongo documentation calls “MongoDB Extended JSON” which represents these types a number of ways including a strict JSON compliant form (a standard that any JSON parser should understand): For example if a field had been populated with

{ "created": new Date("Feb 12 2014 15:28:31 GMT") }

… when we exported it using monogoexport it would appear as…

{
  "created": { "$date": 1392218911000 }
}

The $date key is a flag to consuming applications that what follows is the milliseconds-since-1970 value. One consuming application that understands that is mongoimport which takes the strict mode “MongoDB Extended JSON” as its default input format. When creating JSON data from another source for importing into MongoDB, using this data structure will ensure that MongoDB will use a BSON date type to store the field.

Although MongoDB does a pretty good job of hiding the fact that it transforming JSON documents to BSON format, knowing that it is doing that better equips you for the cases where that transformation runs into the real world.

Bonus references:

http://​www​.yuiblog​.com/​b​l​o​g​/​2​0​0​9​/​0​8​/​1​1​/​v​i​d​e​o​-​c​r​o​c​k​f​o​r​d​-​j​s​on/ — Crockford on JSON.
http://​bsonspec​.org/ — The BSON specification, covering the internal layout of BSON data files.

This post was written by Dj Walker-Morgan.