First Steps of an Analytics Platform With MongoDB

MongoDB excels at analytics. Every week, we have customers asking for insight on building an analytics and metrics platform. We have seen outstanding performance from good practices and have seen issues with common bad practices.

Customers have different types of analytics engines on our hosted MongoDB platform ranging from usage metrics, business domain specific metrics, to financial platforms. The most generic type of metrics that most clients start tracking are events (e.g. “how many people walked into my stores” or “how many people opened an iPhone application”).

A proper schema is the first step to getting off the ground quickly on a platform that will scale.

The Naive Approach

With MongoDB, the first urge is to begin inserting documents quickly. The first documents typically have the following schema:

{
  store_id: ObjectId(), // Object id of a store
  event: "door open", // will be one of "door opened", "sale made", or "phone calls" 
  created_at: new Date("2013-01-29T08:43:00Z")
}

To run a query on the eventstore_id, and created_at, you run the following query:

db.events.find({store_id: ObjectId("aaa"), created_at: {$gte: new Date("2013-01-29T00:00:00Z"), $lte: new Date("2013-01-30T00:00:00Z")}}) 

These types of queries are deceptive. When you build the query on your local environment, it is fast. When scaling to 10 GB, they become slow. Typically, to increase speed, compound indexes are added for the following:

db.events.ensureIndex({store_id: 1, created_at: 1})
db.events.ensureIndex({event: 1, created_at: 1})
db.events.ensureIndex({store_id: 1, event: 1, created_at: 1} )

Each of these indexes must entirely fit in RAM.  Any new document will have a seemingly randomly chosen “store_id”.  An insert command will have a high probability of inserting the document record to the middle of an index.  To minimize RAM usage, it is best to insert sequentially: termed “writing to the right side of the index”.  Any new key is greater than or equal to the previous index key.

An Optimized Document Schema

To optimize your document schema, create a time_bucket attribute that breaks down acceptable date ranges to hour, day, month, week, quarter, and/or year.

{
  store_id: ObjectId(), // Object id of a store
  event: "door open",
  created_at: new Date("2013-01-29T08:43:00Z"),
  time_bucket: [
    "2013-01-29 08-hour",
    "2013-01-29-day",
    "2013-04-week",
    "2013-01-month",
    "2013-01-quarter",
    "2013-year"
  ]
}

With the optimized schema, you would create the following indexes:

db.events.ensureIndex({time_bucket: 1, store_id: 1, event: 1})
db.events.ensureIndex({time_bucket: 1, event: 1})

With this document schema, we use a practice called “bucketing”. Instead of building a query on a range, we run the query:

db.events.find({store_id: ObjectId("aaa"), "time_bucket": "2013-01-29-day"}) 

The draw back of this optimized document schema is every query must include atime_bucket attribute for every non-_id query. However, when querying most reporting systems, a date specification is required

Better Use of RAM

Using the optimized time_bucket, new documents are added to the right side of the index. Any inserted document will have a greater time_bucket value than the previous documents. By adding to the right side of the index and using time_bucket to query, MongoDB will swap to disk any rarely older documents. MongoDB runs with minimal RAM usage. Your “hot data” size will be the most recently accessed (typically 1 — 3 months with most analytics applications), and the older data will settle nicely to disk.

Neither queries nor inserts will access the middle of the index, and older index chunks can swap to disk.

Bonus Points: Using the Aggregation Framework

Using aggregation to find the number of specific events per day, we can run:

db.events.aggregate( {$match: {time_bucket: "2013-01-month"}}, {$unwind: "$time_bucket"}, {$project: {time_bucket_event: {$concat: ["$time_bucket", "/", "$event"]}}}, {$group: {_id: "$time_bucket_event", event_count: {"$sum": 1}}} ) 

Disclaimer: I used the $concat operator that will be available in MongoDB 2.4. To use a hosted MongoDB 2.4 Development Branch, take a look at our experimental databases.

When using the aggregation framework, cache a final run for any particular day to a summated collection. Use this summated collection to present data via any application or reporting server.

Follow Up Materials

The premier presentation for MongoDB Analytics is by John Nunemaker of Github. He regularly makes the rounds at MongoDB conferences covering how to do MongoDB analytics properly. 10gen makes these presentations available. MongoDB for Analytics (at Github)

When building a scaling analytics system, look for logical “buckets” for data. Avoid using “$in”, “$gte”, “$lte” operators when possible. MongoDB is fun because it rewards creativity for good schema design.

This post was written by Chris Winslett.

  • Rob Shepherd, 95.2

    This is an interesting article, that I found useful, however when I read-up some more on the subject I am not sure I understand what is going on here. Please permit me to express what I have understood and ask you for clarification.

    You describe the limitation of the first form (indexed raw timestamp) as such : “New doc­u­ments do not have greater val­ues for store_id and event than per­vi­ously[sic] inserted doc­u­ments. Any insert com­mand will add a doc­u­ment record to the mid­dle of an index. Any query will have to ven­ture into the mid­dle of the index. Thus, all indexes must fit in RAM”

    .…But in the second form (using bucketted date fields in an array) it still includes the store_id as before.

    How will this change the index to only “right-side” the data?

    Surely just the nature of reversing the arguments to ensureIndex from

    db.events.ensureIndex({store_id: 1, created_at: 1})

    …to…

    db.events.ensureIndex({time_bucket: 1, store_id: 1})

    should be the trick to make the right-hand side the index “hot” isn’t it?as the time-based portion of the index is now the first argument to the index.…

    Also, conversely to what you suggest — by using multi-keys on the bucket array, all documents entered this year will include

    “2013-year” as a multi-key entry as well as more specific granules like: “2013–01-29 08-hour”.

    Surely this dirties the index as the specific granules are added the right, but the less specific entries need to be placed in the index before hand.…

    I think it would work better by having 6 different fields in the doc E.g. time_bucket_year: “2013”, time_bucket_month “201308”, time_bucket_day: “20130812” etc.

    Then having an index on each independently.

    Your comments on this subject in response would be most appreciated.

    Thank you. Rob

    • Chris Winslett

      Rob,

      Thank you for the request for clarification.I re-worded the paragraph to:

      Each of these indexes must entirely fit in RAM.Any new document will have a seemingly randomly chosen “store_id”.An insert command will have a high probability of inserting the document record to the middle of an index.To minimize RAM usage, it is best to insert sequentially: termed “writing to the right side of the index”.Any new key is greater than or equal to the previous index key.