MongoHQ Blog

MongoDB news, tips and tricks from the MongoDB as a Service Company

Converting Data to Metrics - MongoDB Analytics Part 2

| Comments

In part one of ”First Steps of an Analytics Platform with MongoDB”, we discussed how to build an efficient logging portion of an analytics system using time buckets or time dimension cubes. The next logical step is summating the data from the logs into cacheable values. Toward the end, we showed a simple common example:

1
2
3
4
5
6
db.events.aggregate(
  {$match: {time_bucket: "2013-01-month"}},
  {$unwind: "$time_bucket"},
  {$project: {time_bucket_event: {$concat: ["$time_bucket", "/", "$event"]}}},
  {$group: {_id: "$time_bucket_event", event_count: {"$sum": 1}}}
)

Let’s take a step back and look at the different options: Aggregation Framework or Map Reduce?

Aggregation Framework or Map Reduce?

Since 2.2.x release, the aggregation framework has been the de facto MongoDB number cruncher. If you are familiar with a SQL db’s group by functions, you will be at home with the functions. Performance wise, Aggregation Framework smokes MapReduce, like not even close.

Unless your required data manipulation functions are not present in the current versions, chose Aggregation Framework. For instance, different types of statistical analysis may be difficult with the Aggregation Framework. Averages are easy. Any type of deviation calculations will require aggregation process calls: one for averaging and the other for deviation calculations.

In sharded MongoDB environments with aggregation framework, you will get the benefit of distributed processing at the data node level as you did with map reduces. Thus, each data node will return the summated results, and the mongos will concatenate and process the results returned from data nodes.

Getting started with Aggregation Framework

Aggregation framework is a series of actions, also known as a “pipeline”. This pipeline is processed in order and each action can filter or manipulate data. For instance, if you wanted to filter data and concat a variable called name:

1
2
3
4
db.my_collection.aggregate([
  {$match: {first_name: "Clark"}},
  {$project: {_id: "$id", full_name: {$concat: [$first_name, " ", $last_name]}}}
])

The above will return a full name for all individuals with first_name “Clark”. For a complete list of the functions with the aggregation framework, see the 10gen Aggregation Framework Reference.

Processing and Caching

Repeat after me: cache aggregation queries. While running aggregation queries is fast, it is best not to run these commands for every application action. As with most activities, running this command once is fast. Running this command 1000s of times per second will deteriorate performance in the rest of the stack. Also, these analytics systems form the backbone of big data; it is small now, but it production these systems grow quickly.

So, we will run these aggregation commands once, and cache the results. For best results on application performance, try running these actions asynchronously. Trigger the Aggregation Framework through background worker that will run and save the updated values to your aggregation collection.

You will need to run two commands to save the output: 1) the aggregation command and 2) the upsert command. Upsert is a good use case here to update, or insert if does not exist.

Measurement and Summary

After storing data in a best practice way, summating and caching data is the next step for any analytics platform. There is a right way and many wrong ways, and simple algorithm changes can yield good performance gains. While you are measuring your ability to run these analytics systems, you should also measure the performance of these summation queries. Build the measurement of the queries while you are building the query – perhaps in the same background job.

I can guarantee your summation jobs will behave differently with 50 GB of data versus 200 MB of data in a development environment. If you do not measure these summation jobs, you will wake up with performance that you have no historical record. Knowing how you got to that point is unknownable without metrics from the beginning. Furthermore, knowing what is “good” is also a shot in the dark. Measure. Measure. Measure.

Go forth, take your data, and summate.

MongoDB 2.4.1 Maintenance Window Complete

| Comments

The maintenance window to upgrade MongoDB for all of our shared environments has been completed.

Why the forced upgrade?

This emergency upgrade was required because of a vulnerability affecting older versions of MongoDB (MongoDB 2.0 and MongoDB 2.2). The vulnerability (SERVER-9124) was caused by deficiencies with javascript sandboxing, allowing the possibility of system commands to be run and if properly exploited, could have resulted in access to other data on the host.

Working with 10gen and understanding the core issue, we made the decision to move quickly to upgrade all of our shared database plans, starting with large databases and finishing, today, with our sandbox/free plans.

Affected environments

The vast majority of our hosts (including all custom deployments) are not at risk from this vulnerability. We’ve isolated Mongo processes on most of our shared hosts in such a way that user data is isolated and individual database vulnerabilities are contained. Some of our older shared hosts haven’t yet been migrated to the new containers, we’re now speeding up our efforts to replace those. Sandbox database hosts are, by definition, going to be more susceptible to problems like this since multiple users share a single Mongo process, we will continue to aggressively upgrade Mongo versions on these.

Email notification

We received reports of some customers not receiving notifications of the upgrade. We are very sorry about this and have resolved the issue. We strive to provide an excellent customer experience and proper communication is an absolute must.

Thank you again for your patience. We have talked to many of you today as we worked through this upgrade together, but if you are having any additional issues or have questions, we’d love to help. You can reach our team at: support@mongohq.com.

MongoDB 2.4.1 Now Available on MongoHQ

| Comments

MongoDB 2.4.1 has been released and it is an important update to MongoDB 2.4.0, addressing secondary index issues with MongoDB replica sets. This should only affect any replica sets that were upgraded and then had a member re-synced or added after the upgrade to MongoDB 2.4.0.

We have updated MongoHQ to include MongoDB 2.4.1 and you can safely update your replica sets. If you are using 2.4.0 and have a single/standalone instance of MongoDB, this release does not affect you.

To upgrade, visit the MongoHQ version selector on MongoHQ.com or sign up for a MongoHQ account and try out MongoDB 2.4.1.

If you are interested in finding out more information about this issue, view SERVER-9087 on the MongoDB Jira site.