The New Normal: Failure Domains and Safety

Snapdragons 070416, Dwight Sipler,

Snapdragons 070416, Dwight Sipler,

Through this series, we've talked about antifragility, disposable code, high leverage, and team-scale autonomy. Earlier, we looked at the benefits of team-scale autonomy: It breaks dependencies between teams, allowing the average speed of the whole organization to increase. People on the teams will be more fulfilled and productive, too. These are nice benefits you can expect from this style, but it's not all unicorns and rainbows. There is some very real, very hard work that has to be done to get there. It should already be clear that you must challenge assumptions about architecture and development processes. But we also need to talk about critical issues of failure domains and safety.

Here are my fundamental theses:

  1. When failure domains are large and overlapping, you will not sustain team-scale autonomy.
  2. When the system does not provide safety, you will not sustain team-scale autonomy.

It comes down to those risk equations we looked at before. When the spatial extent of a failure is large, each incident costs much more money. With a larger spatial extent, it takes longer to trace the problem and restore service, which means the duration of the failure and its attendant costs will be larger. Put the two factors together and you get an exponential increase in the cost of an incident. The organization will quickly determine that these costs are too large to sustain and you will get review processes and tight change controls. Team-scale autonomy goes out the window as an absurd idea.

Therefore, if team-scale autonomy is desirable, you must work to make the failure domains as small and independent as possible. This is not a one-time transformation effort. It requires ongoing work. Services that interact with each other have a kind of attraction that tends to increase their coupling over time. Resisting that attraction requires effort.

Shrinking Failure Domains

I think about safety in terms of "failure domains." That is, when something goes wrong in a service or subsystem, how much of the organization is affected? Think about a database: if it is unavailable, then obviously every application that uses the database is affected to some degree. But the damage extends farther: every application that calls one of those applications is also affected. This is the cascading failure case. The failure domain of that database is the set of all applications that are affected when it goes away.

There are some purely technical solutions we can apply to reduce the size of our failure domains.


The idea behind bulkheads is "damage containment." We assume that things will happen. A service may crash on its own, or it may get overloaded by demand from a consumer. We create bulkheads to stop that damage from spreading. For example, instead of creating a single large pool of machines to run a critical service, we can create multiple instances of it. Each instance is backed up by its own cluster of machines. That way, if one instance goes down, other consumers can keep using the other instances. What had been one failure domain is now two, or three, or fifty. In the extreme case, you stop running your own service and give consumers the ability to provision and launch their own private instances of it.

Circuit Breakers

A circuit breaker is a safety feature that allows a consumer to detect failure in a provider and stop making calls to it. The circuit breaker allows the consumer to decouple itself from a malfunctioning provider. That means circuit breakers separate layers into different failure domains. A consumer does not necessarily fail when its provider does. The consumer may not be completely healthy... there can be important features disabled while the circuit breaker is popped. But at least it still runs.

Lambda or Containers

Take note: whenever you have two features in the same code, running in the same processes, those features are coupled into a single failure domain. If one of them has a crash bug or memory leak, the other feature will also be damaged. The more that you run in a single process, the larger the failure domain.

Infrastructure like AWS Lambda or Docker Containers can help with this. They reduce the per-process overhead, thereby allowing you to decouple features into smaller deployment units. In the case of Lambda, each request can be its own deployable unit.


Turning synchronous request/reply calls into asynchronous processes also cleaves apart failure domains. When consumer and provider are isolated by a queue, the consumer can create a surge of demand without damaging the provider. Likewise, the provider may be unavailable for a short time (either for a deployment or due to a bug) and the consumer is unaffected. Consumer and provider can scale independently. If the consumer is web- or app-facing, then it must scale according to the amount of concurrent load. The provider needs to scale in terms of throughput or latency.

Improving Safety

The other factor that erodes team-scale autonomy is when the team makes errors. When most organizations investigate an incident with a post mortem or root cause analysis, they will conclude that human error either triggered the incident or at least aggravated an existing issue. The telling part is what happens next. The organization may respond with human-driven solutions: training, review, or in the worst cases, penalties and punishment. This organization will have zero team-scale autonomy.

Other organizations will treat the human error as a system problem. Instead of saying that the human let the system down, they say that the system let the humans down by allowing or encouraging the error. These organizations strive to create safety in their systems.


What is the most important feature in a word processor? It has nothing to do with text formatting. It's the "undo" feature. Undo allows a human to act with confidence, knowing that any error they make is recoverable. Undo is what lets a human click a toolbar button, discover that it just translated the whole document into Klingon, and then get their work back. Undo is a safety feature. It keeps users from harming themselves. Undo provides recoverability.

Sustaining team-scale autonomy requires a similar kind of recoverability. Did you just deploy a broken version of your service? You really want some way to recover from that. You have a few options available, depending on your infrastructure and operations:

  1. Push a new version that fixes the bug.
  2. Redeploy the previous version.
  3. Switch the production IP address back to the cluster that's still running the previous version.
  4. Same as #3 but with a load balancer, DNS service, or API gateway.


If there's a feature that I like better than "undo," maybe it's "preview." Show me what the document will look like before I waste money on ink and paper. Preview gives me the ability to understand the consequences of the action I'm about to take.

This is the idea of "think globally, act locally" as applied to our systems. When a team is about to make a change, they need to know who might come pounding on their door.

In our world, preview does not come for free. Few tools have a comprehensive "dry run" option. But as we build these tools into an infrastructure, we should keep an eye on visibility. Are you about to spend a small nation's defense budget on virtual machines? Are you about to make a firewall change that blocks every user that comes through a URL shortener?


After a major incident, there is usually a cloud of free-floating blame wafting around looking for someone to condense around. Environments like these get finger-pointy very quickly.

When faults and failures are hard to diagnose, suspicion lands first on the least understood technology. That is why the network team gets blamed so often. To software people, the network is a mysterious series of tubes.

Put these effects together, and you have blamestorming and blameshifting, which erodes safety for everyone and team-scale autonomy is out the window.

Traceability means that a fault or failure can be isolated to the service that had the problem. Traceability is wrapped up with logging, monitoring, and system topology. We often don't have quite the information we want during the incident, so there's a certain degree of sleuthing about too. It is important that everyone is on a level playing field with respect to traceability. Don't just give Splunk access to your Unix operations team and Cacti to your network team. When everyone has their own partial information, they're not diagnosing the problem, they're playing Clue.

Extend traceability all the way to the app and service teams. They should be the first ones to see an anomaly and the best ones to interpret it. They might even be able to avert the anomaly from becoming a problem.


Actually preventing an anomaly from becoming a problem means that the team also has responsibility, in two senses of the word. First, they know that it is their problem to solve. There is no one else. There's no backstop. They can't let it slide. Second, they are able to respond. They have the necessary access and wherewithal to effect change in the running system. This mandates a higher degree of access than many organizations are comfortable with. Nevertheless, it is a necessary component of safety to maintain team-scale autonomy.


Here we encounter some of the hard parts of the new normal. This is not playing on "easy mode." Easy mode looks like command and control, everything proceeding slowly enough that no visible harm happens. (Until you notice your competitors overtaking you left and right!) Sustaining team-scale autonomy means being vigilant. There will be constant efforts the reign in the chaos. These efforts often resemble good governance, but they're rooted in old assumptions. Staying in the new normal requires work to find solutions that make the systems enforce their own safety. When there is a problem, you must find a solution that further empowers the team instead of removing their privileges.

Read all of Michael Nygard's The New Normal series here.