Search This Blog


Friday, June 28, 2013

A Difficult Month - Post Mortem

As some of you may know, we had a pretty bad month starting May 15th and ending June 14th. IronMQ on AWS us-east had intermittent outages and degraded performance due to an edge case flaw that was exposed after a year and a half of solid service. Generally these were pretty short outages, but they were happening daily causing grief for our customers and long days and late nights for us. All in all, we ended that time span with 99.31% uptime which is 5 hours and 8 minutes of downtime. That is unacceptable and completely contrary to the core tenets of our company.

What Happened?

IronMQ is a multi-tenant service providing message queues for thousands of users and handling 100’s of millions of requests per day. It was built to scale to a very large number of users and a lot of data and it did that very well for a good year and a half. But there was one thing that wasn’t ready to scale and that was the ability to shard a single queue. When a single queue got too big, we weren’t able to split the data across shards. We kept all messages for a specific queue on a single shard so we could support our IronMQ guarantees such as FIFO and once and only once delivery. This wasn’t an issue in the past because typical usage of a queue is to put messages on, then take them off shortly after. 

Then of course, a customer started to push millions of messages onto a single queue, 10’s of thousands per minute and didn't take them off. It wasn’t the rate or the amount that was a problem, it was the fact that the messages stayed on the queue and it just kept growing. 

You might ask, why doesn’t SQS have this problem? We asked ourselves that same question since SQS is really the only other provider that would offer an unlimited size queue so it would seem like a good comparison. The issue with this comparison is that they don’t have guarantees like we do. They don’t guarantee things like single message delivery, FIFO, consistency (they have eventual consistency). We do. If we didn’t have those guarantees, the problem probably wouldn’t have existed in the first place or if it did, it would have been much easier to fix.

And when it rains, it pours. On June 7th, there was an outage of the entire .io TLD so every .io domain was offline. We published alternative dns entries for all our services in case this happens again. Then AWS had issues with internal dns resolution on June 13th having a similar effect. Not to pass the blame, just thought I'd share everything we had to deal with. When these happened, we were all like...

How We Handled It

We had to figure out a way to distribute messages in a single queue while maintaining our IronMQ guarantees while not affecting performance. It’s a difficult problem without a lot of real-world, cloud-scale prior work to learn from so we had to come up with something new. 

We figured out a solution that both solved our sharding problem and kept our guarantees. 
Unfortunately it wasn’t something we could do quickly as it added a lot of new complexity and required major backend changes. Fortunately, we had a fallback for users that were affected and that was for them to use our Rackspace endpoints. IronMQ on Rackspace was not affected by this so users did have a fallback. 

After figuring out a solution and writing the code for it, we then had to thoroughly test it (which we did over a couple of weeks), write seamless migration code (which would get messages from both old and new systems, but put messages only into the new system), deploy it then finally migrate old data to the new system. We completed the migration on the evening of June 14th. 

Moving On and Moving Forward

After a month of long days and late nights, we can finally move on. There’s always a silver lining when these types of things happen though and in this case, we came out with a much stronger and more scalable product. Moving forward, we’re going to spend a lot of time ensuring we cover any edge cases that could cause issues like this in the future and continue to work on reliability and performance to make IronMQ the best message queue on the cloud. 

In fact, we've already improved on several other things over the past month including better hardware. We're now 100% SSD backed with higher I/O and a lot more memory to serve requests faster and with more consistent performance. We also continue to split apart the services including MQ, Worker, Cache, Auth, HUD, Metrics, etc, at all tiers meaning they don’t share data stores, hardware or anything else and communicate with each other via their API’s as if they were just regular consumers of each other. This helps isolate issues to one service without affecting the others. We've also added the ability to do a slow rollout for major upgrades meaning we can roll it out gradually to subsets of users and stop/rollback if something doesn't work right in production without affecting all customers. And finally, we're soon going to be offering Virtual Private Queues (VPQ) to provide isolated, private instances of IronMQ to users to ensure consistency and reliability for people that need it. 

Thanks to those who provided support, encouragement, and understanding through this rough time. Our sincere apologies for customers who were affected. We trust you are enjoying a much improved IronMQ (and IronWorker).

For customers, I'm always available if you'd like to reach out: +Travis Reeder.

Wednesday, June 26, 2013 Announces Agent Processing for Powering New Relic Plugins

Today announces powerful agent processing capabilities for powering third-party plugins for the New Relic Platform. New Relic has opened up its SaaS service to provide building blocks for creating monitoring capabilities for any technology or service. Performance metrics from IT components and cloud services can now be brought directly into New Relic and viewed alongside existing metrics and graphs.

The New Relic Platform
Many third-party IT components can power their own uploading to the New Relic Platform. Plugins for cloud services, however, will need outside processing capabilities to move data from these third party services to the New Relic Platform. provides a simple and flexible scheduling and processing capablity that can be used power third-party plugins – all without having to create application or setup or manage servers. already powers plugins from top service companies such as Twilio, Airbrake, Stripe, Parse, and Desk and can extend this to any cloud service or API-enabled metric provider. (Plugins for's own IronMQ and IronWorker services are also powered by agents.)

A Library of New Relic Plugins – Powered by

Users of these third-party services can choose from a library of plugins or create new ones and use a simple dashboard to run and operate the plugins. The platform performs the continual API extraction, transformation, and loading of metrics from third party APIs to New Relic.

Starting an Airbrake Agent in
The platform eliminates the complexity associated with operating a plugin and provides a simple and immediate way to benefit from New Relic’s new open performance monitoring platform.

Using to upload Airbrake metrics into the New Relic Platform is a big win for us. Our users gain because they can get the Airbrake plugin running with very little effort. Our engineering team benefits because we don’t have to do anything to move and transform data on behalf of our thousands of users. handles all of this effortlessly and will automatically scale to handle the workload. 
Ben Arent, Product Development, Airbrake

→ Special Offer!! What's even better is users get New Relic Standard for free, forever (instead of the normal price of $49/month/host) by signing up here.

Thursday, June 20, 2013 Metrics + New Relic Platform = Increased Nerd Power

Part 1: Metrics now in the New Relic Platform is pleased to announce its participation as a New Relic Platform partner. As of today, has opened up a gateway for customers to send IronMQ performance data to the New Relic Platform. This means that metrics can be viewed within New Relic’s dashboard, allowing all app-critical information to be on hand in one location. Developers win because they get increased simplicity and availability of data as well as easier and faster ability to manage and scale their applications.

The New Relic Platform
New Relic already sets the bar as a performance management tool for web, server, and mobile app monitoring. The release of the New Relic Platform, which lets third party services integrate metrics into New Relic, makes for an even more comprehensive performance monitoring offering.

IronMQ: The Message Queue
for the Cloud

IronMQ is a high availability message queuing service that makes it easy to create scalable cloud applications, buffer data inputs, and interface with third party systems. Every production-scale application in the cloud needs message queueing to distribute load, buffer streaming data, and coordinate between internal and external processes. Message queues serve alongside app servers and databases as primary components within an application stack.

Tuesday, June 18, 2013 @ HackTheMidwest → Half Empty App Wins API Prize

The Hack the Midwest hackathon was this past weekend (June 15-16th) in Kansas City. The theme was "Build Something Epic" and from the looks of the projects the 30+ teams put together, that certainly proved true. 

Among the many deserving winners was Half Empty? – which won the winner of the API Prize (a GoPro camera). The team consisted of Joseph Andaverde and Scott Smerchek. For such a small team, they certain did some big things. You can check out their application here.

Half Empty? – Winner of the API Prize at HackTheMidwest 2013
Here's a description of the app:
Tweets We Like to See
Analyze a persons tweets to find out if they are a negative nancy or a positive polly in public. Extract information about tweets to see what kinds of topics the person talks about. The application shows a trend over time of a person's negativeness / positiveness.