Post-Mortem of a Rough Day
It is good to take a moment and reflect on what happened on April 21, 2011 and how our services were affected by the well-publicized outage that occurred with our Service Provider, Amazon EC2.
Before proceeding, I wanted to take a moment during a day like this and offer praise to Amazon for not only their vision and leadership to even have an environment like this available, but, in almost all cases, and considering its size and magnitude, how solid the platform has become.
We experienced a partial outage of our services starting at approximately 4:41am EST. After that, our Arrow, Swan and Hydra environments became unavailable and we saw, for a brief period, some limited outages on our Pearl and Tempest environments. They, though, along with Flame and Rose, remained stable throughout the day.
Based on the extended downtime, we made the decision on the evening of April 21 (after 15 hours of downtime) to revert to data snapshots that occurred roughly three hours before our services became unavailable. Any data that was written between then and the time the snapshot occurred should be available for restoration and we will work with our customers to restore that data on an as-needed basis.
We had to revert to using the most recent snapshots of our environments because the drives that each environment utilized continued to be unavailable to us and after much consultation with Amazon, we did not have a clear picture of when they were going to be available again.
Where We Are Now
(Updated: 4/22/2011 2:31AM EST) Outside of issues remaining with the MongoHQ.com web site, all our environments are online and available again. We are still waiting for word from Amazon, but will need to do hard reboots on our Rose and Tempest environments to conclude this event, even though those environments are available. Hopefully, we can avoid this, but will have to wait for word from Amazon in a few hours.
Where to Next
Obviously, this was a rough day for all of our customers and we certainly regret the downtime that occurred. We are moving in the right direction as we are already working to utilize the new additions available for Replica Sets in MongoDB 1.8. Currently, we have been focused on replication for data durability and going forward, we will be offering replication for high availability as well. These steps will help to prevent us from having service provider outages affect our services for long periods of time.
If you have additional questions about the outage, we definitely want to hear from you. So, please send an email to firstname.lastname@example.org and we will answer it promptly. Thank you for your patience during this event.
Tags: amazon ec2 cloud mongodb