The New Normal: From Resilient to Antifragile

Some things benefit from shocks; they thrive and grow when exposed to volatility, randomness, disorder, and stressors...

We all recognize the traditional approach to risk management: prevent problems from ever occurring. If a process fails one time, institute a review step to make sure it never fails that way again. The next level is resilient systems that can cope with change and survive failures. I propose that the new normal is the exact opposite of preventing problems: survive problems, and if you’re not getting enough problems from the outside, make problems for yourself!  This new approach is still counterintuitive in most organizations.

Charles Blount

Charles Blount

In his book, "Antifragile: Things That Gain from Disorder," author Nassim Taleb introduces the concept of antifragility with the assertion that, "...things benefit from shocks; they thrive and grow when exposed to volatility, randomness, disorder, and stressors..."  Taleb continues by noting that while this phenomenon is present virtually everywhere, there is no word for the exact opposite of fragile. So he calls it antifragile, or antifragility.

Whether or not something gains from or is harmed by exposure to volatility is the difference between fragile and antifragile. For example, fragile objects such as fine bone china or a Waterford crystal goblet are fragile and benefit from being left alone and untouched – one drop and they may shatter. "Antifragility" suggests that we should create things that improve by being exposed to random events. Over the long run, antifragile will survive when fragile does not.

Antifragile loves uncertainty and risk. Imagine that you have a very stable IT system. It runs for years without any operator intervention. One day, it crashes. Who is going to know how to repair it? Nobody will have experience with it crashing. This is the paradox of automated operations: the more stable the system is, the less you know about how to fix it when it breaks. If it breaks a lot, you learn how to fix it. If it breaks even more, you create systems that know how to fix it: an auto-repair system. This system moves past resilience to gain from exposure because we learn to address small failures along the way.

Random noise is not an exception, it's the normal condition. As we move to building systems from "better, faster, lighter" services we also see much more frequent deployments to those services. When a service is down, a consumer doesn't know or care whether it's from a deployment or an outage. It doesn't matter, the consumer must still deal with it. (Hence the widespread popularity of my Circuit Breaker pattern.) Frequent deployments look exactly like continuous partial failure. They create randomness in the system, a forcing function that makes the whole enterprise deal with stability in a different and better way. Noise is normal, and our systems get better as we adapt them to tolerate the noise.

Resilience is a good thing, but it’s not enough. Resilience should be our bare minimum requirement. We can do a lot better by aiming for antifragility.

Read all of Michael Nygard's The New Normal series here.