In part one of ”First Steps of an Analytics Platform with MongoDB”, we discussed how to build an efficient logging portion of an analytics system using time buckets or time dimension cubes. The next logical step is summating the data from the logs into cacheable values. Toward the end, we showed a simple common example:
1 2 3 4 5 6
Let’s take a step back and look at the different options: Aggregation Framework or Map Reduce?
Aggregation Framework or Map Reduce?
Since 2.2.x release, the aggregation framework has been the de facto MongoDB number cruncher. If you are familiar with a SQL db’s
group by functions, you will be at home with the functions. Performance wise, Aggregation Framework smokes MapReduce, like not even close.
Unless your required data manipulation functions are not present in the current versions, chose Aggregation Framework. For instance, different types of statistical analysis may be difficult with the Aggregation Framework. Averages are easy. Any type of deviation calculations will require aggregation process calls: one for averaging and the other for deviation calculations.
In sharded MongoDB environments with aggregation framework, you will get the benefit of distributed processing at the data node level as you did with map reduces. Thus, each data node will return the summated results, and the
mongos will concatenate and process the results returned from data nodes.
Getting started with Aggregation Framework
Aggregation framework is a series of actions, also known as a “pipeline”. This pipeline is processed in order and each action can filter or manipulate data. For instance, if you wanted to filter data and concat a variable called
1 2 3 4
The above will return a full name for all individuals with first_name “Clark”. For a complete list of the functions with the aggregation framework, see the 10gen Aggregation Framework Reference.
Processing and Caching
Repeat after me: cache aggregation queries. While running aggregation queries is fast, it is best not to run these commands for every application action. As with most activities, running this command once is fast. Running this command 1000s of times per second will deteriorate performance in the rest of the stack. Also, these analytics systems form the backbone of big data; it is small now, but it production these systems grow quickly.
So, we will run these aggregation commands once, and cache the results. For best results on application performance, try running these actions asynchronously. Trigger the Aggregation Framework through background worker that will run and save the updated values to your aggregation collection.
You will need to run two commands to save the output: 1) the aggregation command and 2) the upsert command. Upsert is a good use case here to update, or insert if does not exist.
Measurement and Summary
After storing data in a best practice way, summating and caching data is the next step for any analytics platform. There is a right way and many wrong ways, and simple algorithm changes can yield good performance gains. While you are measuring your ability to run these analytics systems, you should also measure the performance of these summation queries. Build the measurement of the queries while you are building the query – perhaps in the same background job.
I can guarantee your summation jobs will behave differently with 50 GB of data versus 200 MB of data in a development environment. If you do not measure these summation jobs, you will wake up with performance that you have no historical record. Knowing how you got to that point is unknownable without metrics from the beginning. Furthermore, knowing what is “good” is also a shot in the dark. Measure. Measure. Measure.
Go forth, take your data, and summate.