Running The Matrix: February 2012

»The best way to avoid failure is to fail constantly«

Just about a few days ago I heard about chaos monkey for the first time. Employees at Netflix experienced the hard way that things in a public cloud (AWS in their case) will fail. This is why any application architecture for a cloud scenario must handle failure.
Actually, this isn't something new. You have to do the very same in a classic data center scenario. Just the way to achieve fault tolerance is usually a bit different. In the cloud you try to scale horizontally as much as possible and due to the nature of how the scaling works (e.g. auto scaling web servers in a shared nothing architecture) a single failing component can be easily and automatically replaced. The most common non-cloud approach to fight against the failure is to duel it with duality. With other words: dual servers, power supplies, network links, switches, HBAs, SAN links, fabrics, controller, storage arrays and up to entire data centers often deployed twice or, in larger setups, with n+1 (n>1) redundancy.


The SPOF Monster is often introduced during changes to existing infrastructure

So be it in a cloud or not, in theory you architecture is resilient and fault tolerant, but how about the real world? If you are absolutely sure you could let someone walk into one of your data centers and plug out an arbitrary device without anybody noticing the incident.
Feeling uncomfortable with that thought in mind? Simply because we have the tendency to avoid failure there are almost no chances to spot weaknesses in the design up until something bad happens. In reality there is always something that has been overlooked and the lurking SPOF Monster is waiting for it, striking hard when you are not prepared.

This is why Netflix runs a script that randomly terminates stuff within their infrastructure. This script is called chaos monkey. The fact that they are running within a cloud environment makes it easy to kill something automatically, no need to run into a data center and plug something out.
At first this approach might sound horrifying, but think about it. If your architecture is indeed fault tolerant, why not prove it by constantly challenging it in a controlled way? That way the SPOF Monster has no chance to grow and will be revealed immediately.

/jr

Running The Matrix

Tuesday, February 7, 2012

Chaos Monkey vs. SPOF Monster