The New Normal: Embracing Failure with Netflix, the Chaos Monkey, and Chaos Kong

"The master has failed more times than the beginner has even tried."
- Stephen McCranie

Budapest riot, Cyphunk,  https://flic.kr/p/o13sL

Budapest riot, Cyphunk, 

https://flic.kr/p/o13sL

In my earlier post Minimize Risk by Maximizing Change, I talked about embracing change and touched on three paths towards antifragility. Here’s a bit more on game day and chaos, and how they relate not only to software development but to the organization as a whole.

Netflix Sets the Bar

Netflix is the archetypal example of an organization that has taken the Game Day exercise to a higher level in order to achieve antifragility. Netflix created the Chaos Monkey, which is named for the way it wreaks havoc like a wild monkey set loose in a data center. Chaos Monkey is the continuous version of the game day exercise—conducted all the time, every day. It works on the principle that the best way to avoid major failures is to constantly create minor ones. The software simulates failures of instances of services by randomly shutting down virtual machines. 

Under the old view of risk management, setting the Chaos Monkey loose sounds insane. We traditionally used all of our operations and systems capabilities to prevent failure. The Chaos Monkey does the opposite: deliberately damaging the production environment—and your software must survive it! If you did this without permission, you would be prosecuted and jailed. But as "routine sabotage", the Chaos Monkey helps ensure that Netflix’s production systems are antifragile.

Netflix has taken the Chaos Monkey and driven it to higher and higher levels. Today, the Chaos Monkey is just one in the collection of open source cloud testing tools created by Netflix, which is known as the Simian Army. Another tool, Chaos Kong takes chaos engineering even further. Where the Chaos Monkey shuts down an individual server, Chaos Kong kills everything in an entire AWS region. Check out the Principles of Chaos Engineering, a document Netflix published that synthesizes their efforts in this area.

What this Means to the Rest of Us

The Simian Army toolset operates in the current production environment. Knowing this, software developers recognize that their software must be antifragile when it reaches production. While their individual pieces of software may be antifragile, it doesn’t mean that the organization as a whole is particularly antifragile.

So how do we take these ideas and translate them to the organization that’s running right now? When you do game day all the time, and combine it with the idea of injecting chaos, you get unplanned, spontaneous events that can happen at any time. The organization must keep working. We could have something like the HR equivalent of the Chaos Monkey. The "HR Bot" would run every few weeks, randomly dismissing several employees. Of course we wouldn’t do that but it does provide a nice analogy—every few weeks, some people get an unplanned vacation and we watch to see what happens. It's a variant on the old disaster preparedness exercise where we would mark people as "sick or disabled," except doing it continuously.

If you knew that anyone–even you–could disappear for a week at any time, you would operate very differently. Well, the truth is that any of us can disappear at any time. Your organization must adapt and find ways to continue when any given person at any given time could be gone for a few days. That’s real life. People get sick. Kids get sick. People quit, sometimes with short notice. Someone has a car accident, or wins the lottery and disappears forever. The usual approach is to let the missing person's work sit and wait until they return. If they don't return, we treat it as a crisis. We should, instead, treat it as the inevitable outcome of working with people. We are fallible and mortal.

Knowing that any employee could be gone on any given day suggests that the organization should probably work on sharing information so people are better able to figure out their tasks when management is not there. Responsible collaborators would know that they couldn’t keep all their job knowledge in their head, they would need to have everything documented and accessible so others could pick it up and do a procedure.  Contrast this with traditional management techniques, which focus on giving people tasks, with the manager acting as the nexus for communication.

Work as if you are the midst of a really bad flu season. Treat disappearing people as standard operating procedure.

Further Down the Path

Looking back at my previous posts in this series, I’ve been building the case that antifragility is about more than just agile software development or a lean initiative. It’s a change in the way you communicate and work with each other. And its about organizations creating change to get really good at it and use it as a competitive advantage. In coming posts I’ll explore some positive steps I see being taken toward antifragility at the organizational level.


Read all of Michael Nygard's The New Normal series here.