Learning from Facebook’s Outage

Making the most of Facebook's outage Thanks to Kārlis Dambrāns for providing the base image. CC BY 2.0

Facebook’s suffered three outages this month; two of which occurred within the span of a week. Ouch. If you know any folks on the FB ops team, now’s a good time to buy ‘em a beer.

Whenever a blip like this appears, it’s a good time for all of us to look at our own infrastructure. Are you prepared?

First things first, what caused the FB outage? Facebook links the most recent to an issue with the Graph API. The September 22nd issue was due to a hiccup with the Realtime Update service. It’s the sort of thing that could happen at any company.

Despite the impact, it’s good to see Facebook has a sense of humor about the downtime. Their response to the update service issue reads, “will post an update here as soon as we know more.” I love the sly wink.

Jokes aside, outages are quite serious. Facebook has years of loyalty, brand recognition, and a healthy cash reserve. An outage costs them money, but it’s not the end of the run. For a lot of other businesses, an outage simply can’t happen. If you or your customers are involved in finance or medicine, then an outage simply can not happen.

The good news is most of us aren’t in the same boat as Facebook. Facebook’s problems tend to be unique. . In addition, Facebook has difficulty leveraging SaaS solutions that otherwise are used by a wide-array of companies of all sizes and industries.

Another challenge for Facebook is that it makes a lot of sacrifices at the altar of complexity in order to scale to the size that it is. In software complexity is quite an ugly beast. Fred Brooks wrote floridly about it in his 1986 paper No Silver Bullet. The short gist, in Brooks’s own words:

The complexity of software is an essential property, not an accidental one.

The longer version is an entertaining read too, for the morbidly curious:

 

Many of the classic problems of developing software products derive from this essential complexity and its nonlinear increases with size. From the complexity comes the difficulty of communication among team members, which leads to product flaws, cost overruns, schedule delays. From the complexity comes the difficulty of enumerating, much less understanding, all the possible states of the program, and from that comes the unreliability. From complexity of function comes the difficulty of invoking function, which makes programs hard to use. From complexity of structure comes the difficulty of extending programs to new functions without creating side effects. From complexity of structure come the unvisualized states that constitute security trapdoors.

Not only technical problems, but management problems as well come from the complexity. It makes overview hard, thus impeding conceptual integrity. It makes it hard to find and control all the loose ends. It creates the tremendous learning and understanding burden that makes personnel turnover a disaster.

In short, complexity is painful. As Fred notes some complexity is also unavoidable. The good news is it’s not 1986. We’ve developed a lot of beautiful, simple, solutions since then. So, use them!

The most obvious is to have a backup plan. Some folks call this High Availability. There’s enough involved in HA that you could write a book about it. And, sure enough, they have written a book about it. It’s a worthwhile read if you have the time (and are prepared for a mental workout).

If that doesn’t sound appealing, no worries! Since the HA book’s publishing, a good bit of the inherent wisdom resembles common sense.

Redundancy

Redundancy seems like a no-brainer. Are you redundant? Sure! Well, maybe. Sort of.

Take a look at your database. Do you have a good failover plan? Ok, cool. What’s the plan if your cloud region has a hiccup? You’ll need a cross region failover plan for that. This sort of thinking continues up the chain from machines to regions to cloud providers.

In a nutshell, eliminate single points of failure.

Hybrids

Part of the wisdom of redundancy is “Don’t put all your eggs in one basket.” An easy way to ensure that happens is to hybridize infrastructure. That is running on both the cloud and your own hardware.

Why? Well, first that’ll help satisfy a lot of the redundancy points mentioned above. Second, hybridizing ensures your infrastructure is platform agnostic. Build once, deploy anywhere.

That’s something we believe pretty strongly in, at Iron.io. It’s choice and control. Whether you choose a solution like Iron.io, or some other platform to simplify your pipeline, an agnostic solution means your future decision making is unshackled.

Monitoring

The final tidbit is monitoring! If any part of your stack goes upside down, good monitoring will save the day. It’ll either alert you before the problem before it grows to catastrophic proportion, or will significantly simplify the process of identifying and fixing the problem.

Again, SaaS style products tend to win here. Most come bundled with wonderful troubleshooting and monitoring tools. Iron.io certainly does.

When you find yourself talking to any SaaS provider, definitely ask about monitoring and ease of troubleshooting.

Wrapping up

Facebook’s recent foibles are definitely fun watercooler talk. More importantly, it’s also an opportunity to reflect on our own infrastructure decisions.

Don’t let the opportunity slip by!