DevSecOps: Digging into Root Cause Analysis

We have all been there in a postmortem when someone says, “Let’s get to the root of the problem.” And, we all know what that means: Who or what is to blame? We also all know that no one wants to play the blame game, yet we all do. But it isn’t our fault (no blame, see what I did there?). It has been the default system for solving problems in business for decades. It is called root cause analysis.

We can change—for the better.

I recently watched a presentation from Matthew Boeckman (@matthewboeckman) titled, “There is No Root Cause: Emergent Behavior in Complex System.” Matthew is a developer advocate with VictorOps and a technology strategist with Dryan.io. He grew up a systems guy and jokes that he has been in DevOps for 18 years, even though DevOps wasn’t around, because he has always been nice to developers.

Digging in (pun intended), root cause analysis focuses on what went wrong, and how we can prevent it from happening again.

The core problems with root cause analysis for development is that it doesn’t provide for enough complexity and its natural focus is blame, which can undermine a positive DevOps culture.

Root cause analysis was more applicable when waterfall was the development methodology because states stayed consistent for months or even years at a time. In the age of Agile, DevOps, CI/CD, microservices, etc., states of work are in a constant flux. Root cause analysis can’t provide solutions quickly enough. As Matthew notes, in root cause analysis, things are either good or bad, working or broken, uptime or failure. The reality is that our world is more nuanced.

What Matthew recommends is to look at it through the principle of emergence because it, “separates judgment from the good and the bad binary approach to our system health, and instead focuses on behaviors and interactions, patterns and complexities of our system. With practice and effort we can manage them to more desirable states.”

But what does this look like in practice?

Getting back to the analogy of the tree and its roots, the answer is more of a forest than a tree. Trees are one living organism, forests are ecosystems.

Matthew takes this philosophy and mental picture and gives us a better system: Cynefin. It is a Welsh word that means habitat, and was created by Dave Snowden (@snowded), originally for managing IBM’s intellectual capital. It draws on research in systems, complexity, network and learning theories.

Starting in the bottom right quadrant, working counter-clockwise, its goes from simple to more complex.

Simple

These are patterns or behaviors that don’t require a great deal of understanding. DevOps is increasingly setting up automated systems to respond to simple issues.

Complicated

These are known unknowns. You can imagine a set of realities where they can occur and they are probable, but not certain. For instance, a busy harbor might get a storm that causes damage to boats, docks, etc. It is hard for the harbor manager to manage and they need to think about it. This requires people to do some thinking, and it is difficult—if not impossible—to automate.

Complex

This is where we start to see emergent behaviors occur. We don’t have the metrics need to understand or manage these problems or you haven’t looked at that metric before. We start with probing, going into the system and exploring. Think of any collection of humans at any scale. Things are still in the scope of probable, but things change quickly. There are many moving parts that aren’t predictable and that we didn’t fully encounter in our test methodology.

Chaotic

This is, well, chaos. Matthews’ real-world example was when an entire region for AWS went down, causing other regions to be overloaded as system admins were moving services. In chaos, you act, then get a sense of where things are and then respond.

Disorder

In DevOps, this is where you have lack of communication and collaboration. Here teams need to: reduce: figure out what you agree on; analyze: build consensus; and, iterate: move to a quadrant and continue.

Matthew notes that knowledge and practice move patterns toward more favorable quadrants. But, complacency erodes the process. Complex systems left poorly managed will create increasingly complex processes to manage.

How to adopt Cynefin

In the moment: Ask, what quadrant does this map to?
In the post incident report: How did we manage the pattern? Was it complicated, complex, simple? What can we do to change it?
In your sprint planning: Devote time to manage your patterns clockwise. What can we move with a little bit of work?

The reality is that root cause analysis is really only present after the fact. Cynefin calls us to action.

Convinced that Cynefin might be just what your organization needs or want to dig a little deeper? Share and watch Matthew’s full talk here. You can watch any of the 2017 AllDayDevOps sessions free of charge here.

— Derek E. Weeks