![]() |
»The best way to avoid failure is to fail constantly« |
Actually, this isn't something new. You have to do the very same in a classic data center scenario. Just the way to achieve fault tolerance is usually a bit different. In the cloud you try to scale horizontally as much as possible and due to the nature of how the scaling works (e.g. auto scaling web servers in a shared nothing architecture) a single failing component can be easily and automatically replaced. The most common non-cloud approach to fight against the failure is to duel it with duality. With other words: dual servers, power supplies, network links, switches, HBAs, SAN links, fabrics, controller, storage arrays and up to entire data centers often deployed twice or, in larger setups, with n+1 (n>1) redundancy.
![]() | ||||
The SPOF Monster is often introduced during changes to existing infrastructure |
Feeling uncomfortable with that thought in mind? Simply because we have the tendency to avoid failure there are almost no chances to spot weaknesses in the design up until something bad happens. In reality there is always something that has been overlooked and the lurking SPOF Monster is waiting for it, striking hard when you are not prepared.
This is why Netflix runs a script that randomly terminates stuff within their infrastructure. This script is called chaos monkey. The fact that they are running within a cloud environment makes it easy to kill something automatically, no need to run into a data center and plug something out.
At first this approach might sound horrifying, but think about it. If your architecture is indeed fault tolerant, why not prove it by constantly challenging it in a controlled way? That way the SPOF Monster has no chance to grow and will be revealed immediately.
/jr