DevOps.com

  • Latest
    • Articles
    • Features
    • Most Read
    • News
    • News Releases
  • Topics
    • AI
    • Continuous Delivery
    • Continuous Testing
    • Cloud
    • Culture
    • DataOps
    • DevSecOps
    • Enterprise DevOps
    • Leadership Suite
    • DevOps Practice
    • ROELBOB
    • DevOps Toolbox
    • IT as Code
  • Videos/Podcasts
    • Techstrong.tv Podcast
    • Techstrong.tv - Twitch
    • DevOps Unbound
  • Webinars
    • Upcoming
    • On-Demand Webinars
  • Library
  • Events
    • Upcoming Events
    • On-Demand Events
  • Sponsored Content
  • Related Sites
    • Techstrong Group
    • Container Journal
    • Security Boulevard
    • Techstrong Research
    • DevOps Chat
    • DevOps Dozen
    • DevOps TV
    • Techstrong TV
    • Techstrong.tv Podcast
    • Techstrong.tv - Twitch
  • Media Kit
  • About
  • Sponsor
  • AI
  • Cloud
  • Continuous Delivery
  • Continuous Testing
  • DataOps
  • DevSecOps
  • DevOps Onramp
  • Platform Engineering
  • Low-Code/No-Code
  • IT as Code
  • More
    • Application Performance Management/Monitoring
    • Culture
    • Enterprise DevOps
    • ROELBOB
Hot Topics
  • npm is Scam-Spam Cesspool ¦ Google in Microsoft Antitrust Thrust
  • 5 Key Performance Metrics to Track in 2023
  • Debunking Myths About Reliability
  • New Relic Bets on AI to Advance Observability
  • Vega Cloud Commits to Reducing Cloud Costs

Home » Blogs » DevOps Practice » Measuring DevOps Performance

Measuring DevOps Performance

Avatar photoBy: Andrew Davis on June 15, 2020 2 Comments

As IT becomes increasingly central to our organizations, it is increasingly important to improve our ability to deliver innovations efficiently and safely. DevOps is a movement to reimagine the way we deliver software, with an emphasis on delivering value to end users through automation and collaboration. In the midst of complex changes to complex processes, it’s easy to lose sight of the most important point: Our “improvements” must deliver actual improvements. Measuring the performance of a software delivery team is the basic foundation on which you can assess the impact of changes.

Recent Posts By Andrew Davis
  • DevOps for Salesforce: CI/CD Challenges
  • Why Salesforce?
Avatar photo More from Andrew Davis
Related Posts
  • Measuring DevOps Performance
  • Whitepaper – DevOps Performance: The Importance of Measuring Throughput and Stability
  • Value Stream Management: Software Delivery Visibility and Insight
    Related Categories
  • Blogs
  • DevOps Practice
    Related Topics
  • mean time to restore
  • measuring DevOps
  • Operations performance
  • site reliability engineering
  • software delivery performance
  • SRE
Show more
Show less

One of the main contributions of the State of DevOps Report has been to focus consistently on the same key metrics year after year. Although the questions in their survey have evolved and new conclusions have emerged over time, the four key metrics used as benchmarks have remained in place:

TechStrong Con 2023Sponsorships Available
  1. Lead time (from code committed to code deployed)
  2. Deployment frequency (to production)
  3. Change fail percentage (for production deployments)
  4. Mean time to restore (from a production failure)

The book “Accelerate” provides a detailed explanation of each of these metrics, and why they were chosen; those points are summarized here.

The first two of these metrics pertain to innovation, and the fast release of new capabilities. The third and fourth metrics pertain to stability, and the reduction of defects and downtime. As such, these metrics align with the dual goals of DevOps, to “move fast, and not break things.”

These also align with the two core principles of lean management, derived from the Toyota Production System: “Just in time” and “Stop the line.” “Just in time” is the principle that maximum efficiency comes from reducing waste in the system of work; and that the way to reduce waste is to optimize the system to handle smaller and smaller batches, and to deliver them with increasing speed. “Stop the line” means the system of work is tuned not just to expedite delivery, but also to immediately identify defects to prevent them from being released, thus increasing the quality of the product and reducing the likelihood of production failures.

Lead time is important because the shorter the lead time, the more quickly feedback can be received on the software, and thus the faster innovation and improvements can be released.  The book “Accelerate” revealed that one challenge in measuring lead time is it consists of two parts: time to develop a feature, and time to deliver it.

The time to develop a feature begins from the moment a feature is requested, but there are some legitimate reasons why a feature might be deprioritized and remain in a product’s backlog for months or years. There is a high inherent variability in the amount of time it takes to go from feature requested to feature developed. Thus, lead time in the State of DevOps Report focuses on measuring only the time to deliver a feature once it has been developed.

The software delivery part of the lifecycle is an important part of total lead time, and is also much more consistent. By measuring the lead time from code committed to code deployed, you can begin to experiment with process improvements that will reduce waiting and inefficiency, and thus enable faster feedback.

Deployment frequency is the frequency of how often code or configuration changes are deployed to production. Deployment frequency is important since it is inversely related to batch size. Teams that deploy to production once per month deploy a larger batch of changes in each deployment than teams who deploy once per week. All changes are not created equal. Within any batch of changes there will be some which are extremely valuable, and others that are almost insignificant.

Large batch sizes imply that valuable features are waiting in line with all the other changes, thus delaying the delivery of value and benefit. Large batches also increase the risk of deployment failures, and make it much harder to diagnose which of the many changes was responsible if a failure occurs. Teams naturally tend to batch changes together when deployments are painful and tedious. By measuring deployment frequency you can track your team’s progress as you work on making deployments less painful and enabling smaller batch sizes.

Change fail percentage measures how frequently a deployment to production fails. Failure here means that a deployment causes a system outage or degradation, or requires a subsequent hotfix or rollback. Modern software systems are complex, fast-changing systems, so some amount of failure is inevitable. Traditionally it’s been felt that there’s a trade-off between frequency of changes and stability of systems, but the highly-effective teams identified in the State of DevOps Report are characterized by both a high rate of innovation and a low rate of failures. Measuring failure rate allows the team to track and tune their processes to ensure that their testing processes weed out most failures before they occur.

Mean time to restore (MTTR) is closely related to the lead time to release features. In effect, teams that can quickly release features can also quickly release patches. Time to restore indicates the amount of time that a production system remains down, in a degraded state, or with non-working functionality. Such incidents are typically stressful situations, and often have financial implications. Resolving such incidents quickly is a key priority for operations teams. Measuring this metric allows your team to set a baseline on time to recover, and to work to resolve incidents with increasing speed.

In 2018, the State of DevOps Report added a fifth metric, system uptime, which is inversely related to how much time teams spend recovering from failures. The system uptime metric is an important addition for several reasons. First of all, it aligns with the traditional priorities and key performance indicators of sysadmins (the operations team). The number one goal of sysadmins is keeping the lights on or ensuring that systems remain available. The reason for this is simple: the business depends on these systems and when the systems go down, the business goes down. Outages are expensive.

Tracking system uptime is also central to the discipline of site reliability engineering (SRE). SRE is the evolution of the traditional sysadmin role, expanded to encompass web-scale or cloud-scale systems where one engineer might be responsible for managing 10,000 servers. SRE emerged from Google, who shared their practices in the influential book Site Reliability Engineering. One innovation shared in that book is the concept of an error budget, which is the recognition that there is a trade-off between reliability and innovation, and that there are acceptable levels of downtime.

According to the Site Reliability Engineering Book, Chapter 3, “Put simply, a user on a 99% reliable smartphone cannot tell the difference between 99.99% and 99.999% service reliability. With this in mind, rather than simply maximizing uptime, Site Reliability Engineering seeks to balance the risk of unavailability with the goals of rapid innovation and efficient service operations, so that users’ overall happiness — with features, service, and performance — is optimized.”

The State of DevOps Report shows how these five metrics are interrelated (See Figure 1). The timer starts on lead time the moment a developer finishes and commits a feature to version control. How quickly that feature is released depends on the team’s deployment frequency. While frequent deployments are key to fast innovation, they also increase the risk of failures in production. Change fail percentage measures this risk, although frequent small deployments tends to reduce the risk of any given change. If a change fails, the key issue is then the mean time to restore service. The final metric on availability captures the net stability of the production system.

Figure 1: Correlation between the five metrics.

How the Five Key Software Delivery and Operations Performance Metrics Tie Together

Together, these metrics constitute a team’s software delivery performance. The goal of any DevOps initiative should be to improve software delivery performance by strategically developing specific capabilities such as continuous delivery and the use of automated testing.

How your team measures these capabilities is another challenge. But “Accelerate” makes a compelling argument for the validity of surveys. Automated metrics can be implemented over time, although the mechanism to do this will depend on how you do your deployments. Salesforce production organizations track past deployments, but it’s not currently possible to query those deployments, so you would need to measure deployment frequency (for example) using the tools you use to perform the deployments. Salesforce publishes their own service uptime on Trust, but that gives no indication of whether critical custom services that customers build on Salesforce are in a working state or not.

Surveys provide a reasonable proxy for these metrics, especially if responses are given by members of the team in different roles. Guidelines for administering such surveys are beyond the scope of this book, but your teams’ honest responses are the most critical factor. Avoid any policies that could incent the team to exaggerate their answers up or down. Never use these surveys to reward or punish; they should be used simply to inform. Allow teams to track their own progress and to challenge themselves to improve for their own benefit and for the benefit of the organization. As it says in the Agile Manifesto “At regular intervals, the team reflects on how to become more effective, then tunes and adjusts its behavior accordingly.”

Metrics provide a reliable, long-term indicator of how your software delivery team is performing. They open the door for your team to experiment with different approaches and assess their impact using a common standard. The key metrics described here are important because they emphasize end-to-end performance, and thus incent teams to focus on collaboration towards this common goal. Balancing velocity with reliability is critical, thus these metrics should be viewed together, to ensure that one goal is never emphasized at the expense of the other.

Filed Under: Blogs, DevOps Practice Tagged With: mean time to restore, measuring DevOps, Operations performance, site reliability engineering, software delivery performance, SRE

« Why the Success of Edge Computing Relies on a Linux Legacy
Report: SD-WAN Shift to the Cloud Accelerates »

Techstrong TV – Live

Click full-screen to enable volume control
Watch latest episodes and shows

Upcoming Webinars

https://webinars.devops.com/overcoming-business-challenges-with-automation-of-sap-processes
Tuesday, April 4, 2023 - 11:00 am EDT
Key Strategies for a Secure and Productive Hybrid Workforce
Tuesday, April 4, 2023 - 1:00 pm EDT
Using Value Stream Automation Patterns and Analytics to Accelerate DevOps
Thursday, April 6, 2023 - 1:00 pm EDT

Sponsored Content

The Google Cloud DevOps Awards: Apply Now!

January 10, 2023 | Brenna Washington

Codenotary Extends Dynamic SBOM Reach to Serverless Computing Platforms

December 9, 2022 | Mike Vizard

Why a Low-Code Platform Should Have Pro-Code Capabilities

March 24, 2021 | Andrew Manby

AWS Well-Architected Framework Elevates Agility

December 17, 2020 | JT Giri

Practical Approaches to Long-Term Cloud-Native Security

December 5, 2019 | Chris Tozzi

Latest from DevOps.com

npm is Scam-Spam Cesspool ¦ Google in Microsoft Antitrust Thrust
March 31, 2023 | Richi Jennings
5 Key Performance Metrics to Track in 2023
March 31, 2023 | Sarah Guthals
Debunking Myths About Reliability
March 31, 2023 | Kit Merker
New Relic Bets on AI to Advance Observability
March 30, 2023 | Mike Vizard
Vega Cloud Commits to Reducing Cloud Costs
March 30, 2023 | Mike Vizard

TSTV Podcast

On-Demand Webinars

DevOps.com Webinar ReplaysDevOps.com Webinar Replays

GET THE TOP STORIES OF THE WEEK

Most Read on DevOps.com

Don’t Make Big Tech’s Mistakes: Build Leaner IT Teams Instead
March 27, 2023 | Olivier Maes
How to Supercharge Your Engineering Teams
March 27, 2023 | Sean Knapp
Five Great DevOps Job Opportunities
March 27, 2023 | Mike Vizard
The Power of Observability: Performance and Reliability
March 29, 2023 | Javier Antich
Cloud Management Issues Are Coming to a Head
March 29, 2023 | Mike Vizard
  • Home
  • About DevOps.com
  • Meet our Authors
  • Write for DevOps.com
  • Media Kit
  • Sponsor Info
  • Copyright
  • TOS
  • Privacy Policy

Powered by Techstrong Group, Inc.

© 2023 ·Techstrong Group, Inc.All rights reserved.