Measuring DevOps Performance

As IT becomes increasingly central to our organizations, it is increasingly important to improve our ability to deliver innovations efficiently and safely. DevOps is a movement to reimagine the way we deliver software, with an emphasis on delivering value to end users through automation and collaboration. In the midst of complex changes to complex processes, it’s easy to lose sight of the most important point: Our “improvements” must deliver actual improvements. Measuring the performance of a software delivery team is the basic foundation on which you can assess the impact of changes.

One of the main contributions of the State of DevOps Report has been to focus consistently on the same key metrics year after year. Although the questions in their survey have evolved and new conclusions have emerged over time, the four key metrics used as benchmarks have remained in place:

Lead time (from code committed to code deployed)
Deployment frequency (to production)
Change fail percentage (for production deployments)
Mean time to restore (from a production failure)

The book “Accelerate” provides a detailed explanation of each of these metrics, and why they were chosen; those points are summarized here.

The first two of these metrics pertain to innovation, and the fast release of new capabilities. The third and fourth metrics pertain to stability, and the reduction of defects and downtime. As such, these metrics align with the dual goals of DevOps, to “move fast, and not break things.”

These also align with the two core principles of lean management, derived from the Toyota Production System: “Just in time” and “Stop the line.” “Just in time” is the principle that maximum efficiency comes from reducing waste in the system of work; and that the way to reduce waste is to optimize the system to handle smaller and smaller batches, and to deliver them with increasing speed. “Stop the line” means the system of work is tuned not just to expedite delivery, but also to immediately identify defects to prevent them from being released, thus increasing the quality of the product and reducing the likelihood of production failures.

Lead time is important because the shorter the lead time, the more quickly feedback can be received on the software, and thus the faster innovation and improvements can be released. The book “Accelerate” revealed that one challenge in measuring lead time is it consists of two parts: time to develop a feature, and time to deliver it.

The time to develop a feature begins from the moment a feature is requested, but there are some legitimate reasons why a feature might be deprioritized and remain in a product’s backlog for months or years. There is a high inherent variability in the amount of time it takes to go from feature requested to feature developed. Thus, lead time in the State of DevOps Report focuses on measuring only the time to deliver a feature once it has been developed.

The software delivery part of the lifecycle is an important part of total lead time, and is also much more consistent. By measuring the lead time from code committed to code deployed, you can begin to experiment with process improvements that will reduce waiting and inefficiency, and thus enable faster feedback.

Deployment frequency is the frequency of how often code or configuration changes are deployed to production. Deployment frequency is important since it is inversely related to batch size. Teams that deploy to production once per month deploy a larger batch of changes in each deployment than teams who deploy once per week. All changes are not created equal. Within any batch of changes there will be some which are extremely valuable, and others that are almost insignificant.

Large batch sizes imply that valuable features are waiting in line with all the other changes, thus delaying the delivery of value and benefit. Large batches also increase the risk of deployment failures, and make it much harder to diagnose which of the many changes was responsible if a failure occurs. Teams naturally tend to batch changes together when deployments are painful and tedious. By measuring deployment frequency you can track your team’s progress as you work on making deployments less painful and enabling smaller batch sizes.

Change fail percentage measures how frequently a deployment to production fails. Failure here means that a deployment causes a system outage or degradation, or requires a subsequent hotfix or rollback. Modern software systems are complex, fast-changing systems, so some amount of failure is inevitable. Traditionally it’s been felt that there’s a trade-off between frequency of changes and stability of systems, but the highly-effective teams identified in the State of DevOps Report are characterized by both a high rate of innovation and a low rate of failures. Measuring failure rate allows the team to track and tune their processes to ensure that their testing processes weed out most failures before they occur.

Mean time to restore (MTTR) is closely related to the lead time to release features. In effect, teams that can quickly release features can also quickly release patches. Time to restore indicates the amount of time that a production system remains down, in a degraded state, or with non-working functionality. Such incidents are typically stressful situations, and often have financial implications. Resolving such incidents quickly is a key priority for operations teams. Measuring this metric allows your team to set a baseline on time to recover, and to work to resolve incidents with increasing speed.

In 2018, the State of DevOps Report added a fifth metric, system uptime, which is inversely related to how much time teams spend recovering from failures. The system uptime metric is an important addition for several reasons. First of all, it aligns with the traditional priorities and key performance indicators of sysadmins (the operations team). The number one goal of sysadmins is keeping the lights on or ensuring that systems remain available. The reason for this is simple: the business depends on these systems and when the systems go down, the business goes down. Outages are expensive.

Tracking system uptime is also central to the discipline of site reliability engineering (SRE). SRE is the evolution of the traditional sysadmin role, expanded to encompass web-scale or cloud-scale systems where one engineer might be responsible for managing 10,000 servers. SRE emerged from Google, who shared their practices in the influential book Site Reliability Engineering. One innovation shared in that book is the concept of an error budget, which is the recognition that there is a trade-off between reliability and innovation, and that there are acceptable levels of downtime.

According to the Site Reliability Engineering Book, Chapter 3, “Put simply, a user on a 99% reliable smartphone cannot tell the difference between 99.99% and 99.999% service reliability. With this in mind, rather than simply maximizing uptime, Site Reliability Engineering seeks to balance the risk of unavailability with the goals of rapid innovation and efficient service operations, so that users’ overall happiness — with features, service, and performance — is optimized.”

The State of DevOps Report shows how these five metrics are interrelated (See Figure 1). The timer starts on lead time the moment a developer finishes and commits a feature to version control. How quickly that feature is released depends on the team’s deployment frequency. While frequent deployments are key to fast innovation, they also increase the risk of failures in production. Change fail percentage measures this risk, although frequent small deployments tends to reduce the risk of any given change. If a change fails, the key issue is then the mean time to restore service. The final metric on availability captures the net stability of the production system.

Figure 1: Correlation between the five metrics.

How the Five Key Software Delivery and Operations Performance Metrics Tie Together

Together, these metrics constitute a team’s software delivery performance. The goal of any DevOps initiative should be to improve software delivery performance by strategically developing specific capabilities such as continuous delivery and the use of automated testing.

How your team measures these capabilities is another challenge. But “Accelerate” makes a compelling argument for the validity of surveys. Automated metrics can be implemented over time, although the mechanism to do this will depend on how you do your deployments. Salesforce production organizations track past deployments, but it’s not currently possible to query those deployments, so you would need to measure deployment frequency (for example) using the tools you use to perform the deployments. Salesforce publishes their own service uptime on Trust, but that gives no indication of whether critical custom services that customers build on Salesforce are in a working state or not.

Surveys provide a reasonable proxy for these metrics, especially if responses are given by members of the team in different roles. Guidelines for administering such surveys are beyond the scope of this book, but your teams’ honest responses are the most critical factor. Avoid any policies that could incent the team to exaggerate their answers up or down. Never use these surveys to reward or punish; they should be used simply to inform. Allow teams to track their own progress and to challenge themselves to improve for their own benefit and for the benefit of the organization. As it says in the Agile Manifesto “At regular intervals, the team reflects on how to become more effective, then tunes and adjusts its behavior accordingly.”

Metrics provide a reliable, long-term indicator of how your software delivery team is performing. They open the door for your team to experiment with different approaches and assess their impact using a common standard. The key metrics described here are important because they emphasize end-to-end performance, and thus incent teams to focus on collaboration towards this common goal. Balancing velocity with reliability is critical, thus these metrics should be viewed together, to ensure that one goal is never emphasized at the expense of the other.