Measuring Success in DevOps: Top Four DevOps KPIs

There is no news in DevOps concept. Many companies have integrated DevOps to improve and accelerate software development and help drive their digital transformation, and that number continues to grow exponentially. There are now entire tool ecosystems, methodologies and transformation models—as well as endless resources—available to guide companies along the DevOps journey.

But DevOps success can be difficult to measure. DevOps isn’t a formal framework; it’s more of a culture and a set of practices, and there is limited guidance available to ensure you’re doing it properly or accurately measuring your successes and failures. DevOps also looks different in every organization—no two DevOps shops will be the same.

However, despite the nebulous definition of what a DevOps organization looks like, there are some core key performance indicators (KPIs) that should be common to all DevOps environments: asset management, monitoring, continuous integration/continuous deployment and continuous security.

Asset Management

In this context, asset management means measuring what is automated and how effectively it is controlled. It doesn’t matter if you are using tools such as Puppet, Chef, Ansible or Terraform; what matters is how effectively you use them. These tools allow you to automatically provision bare-metal or cloud assets, as well as join a load balancer and receive traffic. However, you can achieve similar results from Kubernetes or chaining lambda functions. Automation is really about controlling assets to achieve your business goals.

Whether you’re using on-premises virtualization or cloud workloads, taking the time to streamline your processing cycles down to the minimum hardware required can significantly reduce monthly costs. The higher the percent of your servers, containers, services, etc. are automated, the less time you’ll spend fixing and maintaining your infrastructure. Some of your applications may be uniquely configured to your particular organization, and therefore cannot be automated. However, if applications cannot be automated, it might be time to retire them and build next-generation applications.

Monitoring

Should you be monitoring all of your resources? Yes. Should you also filter out the noise? Absolutely. Monitoring agents should be deployed to all of your servers, and if your PaaS tools and other providers have integrations, you should be using them and parsing out unneeded data. The only thing worse than not monitoring anything is monitoring everything without proper filtering.

Good monitoring should always be coupled with utilizing the information acquired. Capturing logs, parsing them and using tools such as ELK, Datadog or Sumo Logic, for example, give you a large kit that can not only capture and glean useful information from a tremendous amount of data, but can also make decisions and perform predictive analyses. You might learn your stack is operating inefficiently, or not scaling fast enough during peak utilization periods.

Disparate systems should be aggregated into a common tool, lest you spend valuable time combining information across systems or pulling information from data warehouses to get your insight. Let the tools do that work for you. You should also be using tools such as Logstash to pull and understand application, database and other logs in your stack for meaningful data. If not, you’re limiting your insight into the service’s health.

Continuous Integration/Continuous Deployment

Continuous integration and continuous deployment (CI/CD) is the difference between releasing software once a quarter (or year) and performing dozens of releases to production within a day.

But it’s one of the most difficult indicators to measure, since every environment is different. It is important to understand how many code commits don’t make it past your integration environment. These blockages could be the result of poor programming practices, poor code quality or poor testing. However, if features stay in dev or integration for too long, there is surely an underlying issue. In most cases, it centers around automated testing.

So, how do you measure good testing? This is one of the most difficult questions enterprises face, especially when automated testing is added after monolithic applications are already written. Essentially, good testing is about creating good business logic and integrating testing that meets most use cases, while covering each function and class. Immediate feedback on failed or successful tests expedites the automation process. The less automated testing is built, the more QA has to test by hand. Failing to integrate good business and integration testing slows down the entire CD part of the pipeline significantly. Also remember shooting for 100% test coverage is unrealistic—over 80% should be sufficient.

It’s also important to remember applications that are not in a pipeline and must be manually tested will never keep up with the pace of cloud-born innovation. If you are dealing with a giant, monolithic legacy app, it’s time to design its replacement (starting with testing). Pipelines should also have notifications built into them, with chaining logic to react to failures. Tools such as Jenkins, CircleCI, TravisCI and most code repository solutions, such as GitLab, offer rich feature sets to aid in pipeline creation and management.

Continuous Security

The easiest way to ensure security compliance is to shift left and resolve security issues in the development stage. Too often, security is applied at production, which means it’s not part of the environment’s end-to-end process. Securing applications and networks at the development level will give you more confidence that applications will interoperate properly at the production level.

After shifting left, make sure you continuously deploy and monitor your clusters, nodes and pods in a secure manner. Ideally, you have a tool that provides a real-time summary of your cluster’s compliance and security status by looking at workload security and governance checks, cluster worker node CIS Kubernetes Benchmark checks, cluster Ingress controller security best practices and istio security checks.

DevOps Success Never Ends

Success in measuring your core DevOps KPIs is not about getting to full automation immediately—it’s about understanding your biggest issues, continuously securing them and improving them over time. DevOps has no end; success requires continuously measuring your progress and improving your environment with the most effective and up-to-date tools. Take control, implement thorough and meaningful monitoring, deliver with confidence and do it securely.

— Gadi Naor