Continuous Reliability: How QA and SREs Can Improve Their CI/CD Workflow

A key challenge in maintaining and accelerating the speed of software delivery is balancing the rate of change with the reliability of your software. With the move toward an increasingly automated workflow, as well as teams of dozens to hundreds of engineers introducing changes at a high frequency, the ability to identify the issues that will have a severe impact early on becomes critical.

This challenge poses many questions: With so many errors, will spending my precious time to resolve a particular issue be really worth it? How do I detect severe issues in code or infrastructure that could be impactful ahead of time? Once a release has been moved forward, how can I know how well it’s doing–should it be fully deployed, hotfixed or even rolled back?

That’s why we need our CI/CD workflow to work both ways, helping us upgrade and improve our application as well as allowing us to learn and adapt according to how it performs with our users. Enter continuous reliability.

A Feedback Loop Between Everyone

Code quality gates and contextual feedback loops are the CI/CD building blocks that define the emerging practice of continuous reliability. The current and most common feedback loop is between dev and QA teams, and it’s usually the backbone of every good development cycle. However, feedback loops are relevant to the whole development cycle, and they don’t end with QA/dev deploying and testing elements in staging environments.

As we all know, production tends to display unexpected and surprise-oriented behavior, some might even say that issues and bugs appear out of thin air. That’s why QA and SRE teams need accurate answers as to the actual quality of releases. They need to identify when critical anomalies move from staging and production, and ideally have the ability to gate them before they harm the application, workflow or customers.

Defining the Anomalies That Matter

Anomaly detection is critical to continuous reliability. In most applications, error and performance anomalies can be mapped into three core types:

New defects: the introduction of new errors into a system, usually a result of code changes.
Increasing Errors: an increase in the relative volume (i.e. as compared to throughput) of certain errors, usually as a result of a code or infrastructure change.
Slowdowns: a decrease in the response time of critical parts of the application due to inefficient code or improper provisioning or configuration of infrastructure.

The introduction of any of these can lead to CPU, GC and IO increases. As such, it is not only imperative that QA and SRE teams gate them from moving into production, but also to prioritize and assign severities that help tackle the most important issues.

However, detecting anomalies in large-scale systems can be nearly impossible due to a high noise-to-signal ratio, and we’re often left to rely on our users and customers to alert us when something goes wrong. That’s why there’s a need for an objective measure to quantify the quality of a release—one that lies in being able to converge machine learning and critical code-level data to automatically gate bad releases from moving up the chain.

Continuous Reliability Through Quality Gates

Once we have an understanding as to what should count as an anomaly within our application, the next step is setting a quality gate to help us identify, understand and hopefully stop bad code from being promoted to production.

Taking the three core type of anomalies—new issues, regressions and slowdowns—we can evaluate four core quality gates that our code should pass before being promoted:

Error Volume: The normalized error rate of an application, based on calls into the code, should never increase between releases.
Unique Error Count: The number of unique error counts, especially in key applications, should not increase between releases.
New Errors: New errors of a critical type or occurring at a high rate should block a release.
Increasing Errors and Slowdowns: Severe regressions and slowdowns should block a release.

These four gates can help you define a benchmark that is powerful and broadly applicable to complex environments, but also isn’t complex to the point that it remains an academic exercise. They will help you understand whether your code is ready to be deployed to your pre-production or production environment, and help reduce or even eliminate errors, issues, slowdowns and anomalies within your application.

— Tal Weiss