Metrics for DevOps

Your organization is now committed to DevOps methodology. You’ve integrated development, operations, and QA, and you have moved drastically away from waterfall and into a fire hose of application development. How do you find out how well it works? How do you know if it’s working at all? Many net new DevOps organizations are surprised to find that it was easier to start then they thought, but hard to keep alive. This can be because of menacing habits, or just because the team is too far into the weeds. In a word, you need metrics

You need metrics not just as a way to measure the success (or lack of success) of your DevOps program, but also as a way to find out how it can be improved, modified, or extended. Without metrics, you’re flying blind. With metrics, you have a holistic point of view, you know where you are, where you’re going, where you can go, and how to get there. But I am not talking about analytics tools which measure the activities withing the pipeline. I’m talking about measuring the pipeline itself.

What kinds of factors do DevOps metrics measure? For the most part, they measure such things as the speed of development, deployment, and customer response, frequency of deployments and failures, repair time, volume of repair request, and the rate of change in these indicators. The focus of DevOps metrics tends to be on deployment, operations, and support (as opposed to design and early-stage development, for example), since most of the ongoing effort associated with DevOps is in these areas.

Most DevOps metrics fall into three general categories:

People. People are an intrinsic part of any DevOps process. People-oriented metrics measure such things as turnover, capability, and response time. Always start with people. They are the hardest element of any element, and their influence is sometimes hard to spot.
Process. In some ways, DevOps is all about process — the continual deployment/operations/support cycle is an ongoing suite of interwoven processes. But some metrics are more clearly process-oriented than others, particularly those involving continuous delivery, response, and repair. Development-to-deployment lead time, for example, is a largely process-oriented metric, as are deployment frequency and response time. Process metrics can be a measure of speed (Where are the bottlenecks, and is the process itself a bottleneck?), appropriateness (Are all steps relevant?), effectiveness (Does it get the job done?), or efficiency (Are the steps in the optimum sequence? is there a smooth flow within the process?).
Technology. Technology metrics also play a major role in DevOps, measuring such things as uptime (What percentage of the time is the system running? What about the network, and support applications?) and failure rate (What is the percentage of failed deployments, changes, or units?).

Of course, many DevOps metrics involve all three categories to a greater or lesser degree. Perhaps the easiest way to see how metrics play out in practice is to look at the key metrics used by Puppet Labs (Puppet is not a customer of mine, nor am I a customer of theirs) :

Deployment (or Change) Frequency

DevOps practices make frequent or continuous deployment possible; large, high-traffic web sites and cloud-based services make it a necessity. With fast feedback and small-batch development, updated software can be deployed every few days, or even several times per day. In a DevOps environment, deployment frequency can be a direct or indirect measure of response time, team cohesiveness, developer capabilities, development tool effectiveness, and overall DevOps team efficiency.

Change Lead Time

The time from the start of a development cycle (the first new code) to deployment is the change lead time. It’s a measure of the efficiency of the development process, of the complexity of the code and the development systems, and also (like deployment frequency) of team and developer capabilities. If the change lead time is too long, it may be an indication that the development/deployment process is inefficient in certain stages, or that it includes performance bottlenecks.

Change Failure Rate

One of the main goals of DevOps is to turn rapid, frequent deployments into an everyday affair. Needless to say, in order for such deployments to have value, the failure rate must be low. It should, in fact, decrease over time, as the experience and capabilities of the DevOps teams increase. An increasing failure rate, or one that is high and does not go down over time, is a good indication of problems in the overall DevOps process.

Mean Time To Recover (MTTR)

This is the time from a failure to recovery from that failure. It’s generally a good measure of team capabilities, and like the failure rate, it should show an overall decrease over time (allowing for occasional longer recovery times when the team encounters a technically unfamiliar problem). MTTR can also be affected by such things as code (or platform) complexity, the number of new features being implemented, and changes in the operating environment (such as migration to a new cloud server).

Based on these metrics, the ideal DevOps team would produce frequent, rapid deployments with a low (and declining) failure rate and a short (and shrinking) recovery time. In practice, of course, there may be factors which run counter to these trends — less need or opportunity for frequent deployments, for example, or frequent changes in operating conditions or requirements. In general, however, these metrics do cover some of the most important performance issues in DevOps.

The bottom line is that if you’re using DevOps methodology, you need metrics, and you need good metrics tools. Without metrics, you won’t have any way of knowing if your DevOps implementation is doing what you want it to do, or if it includes problem areas that require your attention.

In the very least, good DevOps metrics should cover the four areas outlined above (deployment frequency, lead time, failure rate, and time to recovery), along with other key indicators of performance which matter to your organization. And good DevOps metrics tools should present that information to you in ways which are clear, detailed, accurate, and easily configurable, allowing you to quickly focus on potential problem areas and other key points in the DevOps process.

Tools that measure DevOps performance will typically monitor real-time processes and conditions (such as uptime and downtime, or volume of traffic) and log events (builds, deployments, failures, repair tickets, etc.). More generalized project-management tools which track factors such as project schedule and goals vs. performance can also provide valuable DevOps metrics.

A note of caution: when you start to look closely at DevOps metrics, you’ll find that there are quite a lot of them available. Before you dive into DevOps metrics, it’s important to understand clearly which metrics are most important to you, and how they apply to your DevOps program. If you don’t know what you’re looking for, it’s too easy to get lost in a maze of not-very-relevant data, but if you stay focused, you can head straight for the DevOps metrics gold, and put it to good use.