As some of you may know, we had a pretty bad month starting May 15th and ending June 14th. IronMQ on AWS us-east had intermittent outages and degraded performance due to an edge case flaw that was exposed after a year and a half of solid service. Generally these were pretty short outages, but they were happening daily causing grief for our customers and long days and late nights for us. All in all, we ended that time span with 99.31% uptime which is 5 hours and 8 minutes of downtime. That is unacceptable and completely contrary to the core tenets of our company.
What Happened?
IronMQ is a multi-tenant service providing message queues for thousands of users and handling 100’s of millions of requests per day. It was built to scale to a very large number of users and a lot of data and it did that very well for a good year and a half. But there was one thing that wasn’t ready to scale and that was the ability to shard a single queue. When a single queue got too big, we weren’t able to split the data across shards. We kept all messages for a specific queue on a single shard so we could support our IronMQ guarantees such as FIFO and once and only once delivery. This wasn’t an issue in the past because typical usage of a queue is to put messages on, then take them off shortly after.
Then of course, a customer started to push millions of messages onto a single queue, 10’s of thousands per minute and didn't take them off. It wasn’t the rate or the amount that was a problem, it was the fact that the messages stayed on the queue and it just kept growing.
You might ask, why doesn’t
SQS have this problem? We asked ourselves that same question since SQS is really the only other provider that would offer an unlimited size queue so it would seem like a good comparison. The issue with this comparison is that they don’t have guarantees like we do. They don’t guarantee things like single message delivery, FIFO, consistency (they have eventual consistency). We do. If we didn’t have those guarantees, the problem probably wouldn’t have existed in the first place or if it did, it would have been much easier to fix.
And when it rains, it pours. On June 7th, there was an outage of the entire .io TLD so every .io domain was offline. We published alternative dns entries for all our services in case this happens again. Then AWS had issues with internal dns resolution on June 13th having a similar effect. Not to pass the blame, just thought I'd share everything we had to deal with. When these happened, we were all like...
How We Handled It
We had to figure out a way to distribute messages in a single queue while maintaining our IronMQ guarantees while not affecting performance. It’s a difficult problem without a lot of real-world, cloud-scale prior work to learn from so we had to come up with something new.
We figured out a solution that both solved our sharding problem and kept our guarantees.
Unfortunately it wasn’t something we could do quickly as it added a lot of new complexity and required major backend changes. Fortunately, we had a fallback for users that were affected and that was for them to use our Rackspace endpoints. IronMQ on Rackspace was not affected by this so users did have a fallback.
After figuring out a solution and writing the code for it, we then had to thoroughly test it (which we did over a couple of weeks), write seamless migration code (which would get messages from both old and new systems, but put messages only into the new system), deploy it then finally migrate old data to the new system. We completed the migration on the evening of June 14th.
Moving On and Moving Forward
After a month of long days and late nights, we can finally move on. There’s always a silver lining when these types of things happen though and in this case, we came out with a much stronger and more scalable product. Moving forward, we’re going to spend a lot of time ensuring we cover any edge cases that could cause issues like this in the future and continue to work on reliability and performance to make IronMQ the best message queue on the cloud.
In fact, we've already improved on several other things over the past month including better hardware. We're now 100% SSD backed with higher I/O and a lot more memory to serve requests faster and with more consistent performance. We also continue to split apart the Iron.io services including MQ, Worker, Cache, Auth, HUD, Metrics, etc, at all tiers meaning they don’t share data stores, hardware or anything else and communicate with each other via their API’s as if they were just regular consumers of each other. This helps isolate issues to one service without affecting the others. We've also added the ability to do a slow rollout for major upgrades meaning we can roll it out gradually to subsets of users and stop/rollback if something doesn't work right in production without affecting all customers. And finally, we're soon going to be offering Virtual Private Queues (VPQ) to provide isolated, private instances of IronMQ to users to ensure consistency and reliability for people that need it.
Thanks to those who provided support, encouragement, and understanding through this rough time. Our sincere apologies for customers who were affected. We trust you are enjoying a much improved IronMQ (and IronWorker).
For customers, I'm always available if you'd like to reach out:
+Travis Reeder.