What DevOps Can Learn from High Reliability Organizations

High reliability organizations are organizations that work in areas that are very hazardous or complex, and yet manage to have a lower accident rate than what would be normal for that environment. So what can DevOps and DevOps practitioners learn from the research of high reliability organizations?

David Woods and Sidney Dekker both have done great research on resilience engineering and high reliability organizations. They continually emphasize, “Safety is not something an organization is. Safety is something an organization does.”

The first thing that DevOps practitioners can learn from high reliability organizations is that safety is not a sterile environment with no dangers or no hazards. Research into these organizations shows that safety is actually the presence of things that help manage or mitigate risk, such as systems, people and thought processes. For DevOps organizations, these could be a trusted CI/CD system, automated testing or a code review process that involves a larger portion of the organization early (such as infosec).

Research shows that complex systems can’t be said to be safe merely by their component-level reliability. For instance, if you take one part in the system, such as a fan in a car, and this fan is performing exactly as you expected, that doesn’t mean the overall system is safe.

Complex technological systems have developed to a degree where what used to be possible in looking at these components and putting them together in a safe system now creates an environment where the interactions between these components and their effects still potentially can be unsafe. Of course in our context, unlike nuclear power or aviation, our accidents are typically with software systems. From a financial perspective or a job or livelihood perspective, these accidents are potentially no less significant.

Another lesson that DevOps can learn from HRO is not taking past success as a guarantee of future safety. Research has shown that even if you feel very confident in your training or in the tooling that you’re using, the people at the controls still need to look for signs that you could be overconfident or your confidence is misplaced.

This might happen when you find yourself saying, “Well, we’ve done a deploy like this a million times. It will be fine.” That’s probably a good attitude from a morale perspective, but it would be a mistake: Although it looks like a task or project you’ve done before, there could still be risks involved. There’s a difference between assuming there’s no or very little risk and confidence that you could manage it and have practiced managing it and are still aware of those risks.

The next lesson that can be learned is to not create differences where they don’t apply. Research has shown that people tend to see incidents in other organizations or other departments of the company as not applicable to their area. They’re far away. They’re different and have different restrictions. They have a different operational environment with different people operating them. It’s tempting to dismiss the potential learnings that are raised there, but organizations that want to increase their resilience and reliability purposely look deeper into the underlying patterns in those incidents and try to learn from the parts that do apply to them, as opposed to focusing on how different that other is.

When investigating issues or problem-solving, people must have ways to communicate and share information freely as well as update their mental model of how the system works to avoid fragmented problem-solving. If instead, silos are enforced across an organization (regardless of how or where these artificial boundaries are drawn), then there is a high risk that with no one person or group will be able able to notice if the system design slowly trends away from safety.

Fresh perspectives and diversity on teams matter. Research into high reliability organizations proves that people from different backgrounds who have different viewpoints actually generate more theories regarding why something might or might not work. They end up mitigating more risk situations, reveal more things that might be previously a hidden assumption and allow for minority viewpoints. These points of view also can provide new information or reveal pre-existing hazards that no one was looking at.

We need to narrow the gap between how practitioners operate the system and how management imagines the system to be. The more accurate the non-practitioner view, the easier it is for those people—whether management or other stakeholders—to make decisions that help create safety.

Most organizations typically don’t change until there’s some huge amount of evidence that says they should. This is usually something such as an incident, a disaster or an accident of some sort. If we continue that pattern, these organizations are always learning too late. By applying some of these resilience and high reliability organization principles to our practice, we’re able to move this learning forward and change earlier—and not have to incur the high costs for that learning.

— Thai Wood