DevOps storytime: The tale of the secret passageway named MTTR

When DevOps practitioners and evangelists talk about adoption, they mostly fall into one of two camps. People who talk about driving devops adoption by helping people understand the core ethical reasons of improving the world for those who work in it (I’m usually in this camp with Gene Kim).

Others talk about driving DevOps adoption by helping people understand the monetarily valuable things that come out of DevOps like improved user engagement and conversion via better performance, more resilient operations via adaptive architectures, or an increase in IT throughput delivered by a use of automation technologies. There is, however, a third way to drive adoption; rather than introduce devops directly, introduce mean time to resolve (MTTR).

MTTR is an Industrial Age metric that measures the time from a problem arises to the time it solved. It just so happens that shops that make a nuanced version of MTTR a very visible metric and seek to drive it down, have no other choice but to adopt DevOps methods because DevOps methods and beliefs are more and more required to push MTTR closer and closer to zero.

A Story We Are All Too Familiar With

Let me tell you a story of Alice, Bill and Carl. Alice and Bill work for Carl inside an IT shop where nothing of DevOps is known or practiced. All is manual in the land of Alice and Bill and the tyranny of work in progress keeps productivity and throughput capped.

Sponsorships Available

One day, when Carl asks Alice and Bill to get the cost of system defects and outages under control. Alice and Bill both start to research the methods for managing defects. Bill, the more linearly aligned of the two people chooses to focus on prevention of defects and starts to introduce checkpoints and controls for how changes get made with the aim of preventing defects from being introduced.

The Plot Thickens

Alice on the other hand thinks a little differently than most. Alice believes the answer for reducing costs related to defects is by reducing the impact of each defect introduced. In her mind, she sees the only way to reduce defect counts to zero is to either add more and more gates, barriers and checkpoints thus providing additional layers of certainty, or to stop all change completely. Neither of these options smells right to Alice and she decides the first step for her is to start measuring the costs of the outages.

In her efforts to measure the costs, Alice discovers and then introduces a metric that neither Bill and Carl have seen before called MTTR. Alice explains to Bill and Carl that she doesn’t believe that speed and quality are naturally opposed to each other, but rather that the game to be played is how to be both fast and good. After Carl hears from both Bill, who wants to limit the number of defects introduced, and Alice who wants to limit the impact of each defect, he understands the logic of Alice’s approach, but that it’s just plain crazy to not try to reduce the number of defects. Carl instructs the team to go after both objectives, and limit the number of defects and reduce the impact of each one by measuring and driving down MTTR. When leaving the room Carl like every other executive says, “Remember that you can’t add any new headcount, we have to get costs down, not move them up!”

As Alice and Bill work together to establish the metric and reduce the number of defects, a funny thing happens. Both Alice and Bill see that automation of deployment tasks will positively impact both of their goals without requiring any additional headcount. Alice gets to reduce lead time on applying fixes while Bill gets to decrease the likelihood of errors as the automation scripts improve. While working through the deployment automation, it becomes clear that the exact same effect happens when the QA process is automated. Even funnier, when the previously manual processes are automated, peoples’ attention begins to free up to accelerate the pace of automating manual processes.

Again and again and again it happens, first in automation moving backwards through the development lifecycle and then in instrumentation to both lower the time to notice that a defect has occurred and also to catch a defect earlier in the lifecycle before it gets deployed. Defects and outages go down, speed and throughput go up and what before Alice and Bill can say “Bob’s your uncle”, their teams are doing everything they can to speed up the automation processes themselves to drive cycle time even lower. What a wonderful coincidence! Performance engineering has now become something that the teams now actively engage in.

The Moral of the Story

Did Alice, Bill and Carl live happily ever after again? Time will tell and they are more certain to get better results out of their teams given that they have visibility into, and are grounded by, a metric that puts how you respond to an error as more important than error prevention. Next week, I’ll go under the hood yet again and give a little more insight into how to create a robust, mature and almost qualitative view of my favorite metric: MTTR.