The Original Setup
First, a little background: we built the first version of IronWorker, originally called SimpleWorker (clever name right?), with Ruby. We were a consulting company building apps for other companies and there were two really hot things at that time: Amazon Web Services and Ruby on Rails. So we built apps using Ruby on Rails on AWS and we got some great clients. The reason we built IronWorker was to scratch our own itch. We had several clients building hardware devices that constantly sent in data, 24/7, and we had to collect it and process it into something useful. Typically scheduling big jobs to run through the data every hour, then every day and so on. We decided to build something we could use for all our clients without having to manage a bunch of infrastructure for each of them to process this data. So we built "workers as a service" that we used internally for a while, then we thought there must be other people that need this so we decided to make it public and thus IronWorker was born.
Our sustained CPU usage on our servers was approximately 50-60%. When it increased a bit, we’d add more servers to keep it around 50%. This was fine as long as we didn’t mind paying for a lot of servers (we did mind). The bigger problem was dealing with big traffic spikes. When a big spike in traffic came in, it created a domino effect that would take take down our entire cluster. At some threshold above 50%, our Rails servers would spike up to 100% CPU usage and become unresponsive. This would in turn cause the load balancer to think it failed and take it out of the pool, thereby applying the load that the unresponsive server would have been handling to the remaining servers. And since the remaining servers are now handling the load of the lost server plus the spike, inevitably a second server would go down, the load balancer would take it out of the pool and so on. Pretty soon every server in the pool is toast. This phenomenon is otherwise known as a colossal clusterf**k (ref: +Blake Mizerany).
|Here's a simple drawing of the domino failure effect.|
The only way to avoid this with the setup we had was to have a ton of extra capacity to keep our server utilization much lower than what it was, but that meant spending a ton of money. Something had to change.
We Rewrote ItWe decided to rewrite the API. This was an easy decision, clearly our Ruby on Rails API wasn't going to scale well and coming from many years of Java development and having written a bunch of things that handled tons of load with way less resources than what this Ruby on Rails setup could handle, I knew we could do a lot better. So then the decision came down to which language to use.
Choosing a LanguageI was open to new ideas since the last thing I wanted to do was go back to Java. Java is (was?) a great language in a lot of ways such as performance, but after writing Ruby code for a few years, I was hooked on how productive I could be. Ruby is fun, plain and simple.
After Gonever had another colossal clusterf**k since.
We have grown a lot since those early days. We now get way more traffic, we added two more services (IronMQ and IronCache), and we run 100's of servers to handle our customer's needs. And it's all powered by Go on the backend. In retrospect, it was a great decision to choose Go as it's allowed us to build great products, to grow and scale, and attract grade A talent. And I believe it will continue to help us grow for the foreseeable future.
UPDATE: Lots of conversation about this article at Hacker News.