7 Important Truths About Chaos Engineering

As a relatively new practice, chaos engineering has plenty of myths surrounding it, from randomly shutting down production systems to requiring huge investments of time and money. There’s a lot of confusion over the purpose, the value and the practice of chaos engineering. This presents a problem for DevOps teams, especially since more than half of the teams surveyed in the 2019 Gartner DevOps Survey listed improving system reliability and release quality as one of their top five DevOps objectives.

In this article, we’ll clear up some of this confusion by presenting seven important truths to help you make an informed decision about chaos engineering and how it can help your engineering organization.

Truth 1: Chaos Engineering is not Chaotic

What comes to mind when you think of chaos engineering as a practice? If the answer is causing random production outages, you’re not alone. Tools such as Chaos Monkey popularized this idea, but for most teams practicing chaos engineering, it is a well-planned, controlled process that aims to mitigate chaos rather than cause it.

The goal of chaos engineering isn’t to add chaos, but to mitigate chaos.

It’s true that it involves creating potentially harmful conditions on otherwise healthy systems. For example, we might test our application’s ability to handle load by increasing CPU usage on our servers. With enterprise solutions we have full control over which systems are affected by this test (known as the blast radius), how much CPU we consume (known as the magnitude) and how long it runs for. We can also immediately stop the test and roll back its impact in case of unexpected consequences.

We also know exactly what conditions we’re introducing into our systems and can revert the changes at any time. Yes, we’re still causing harm, but we’re doing so in a way that helps us learn about our systems and reduces the risk of failures, both real-world and induced.

Truth 2: Developers Care About Reliability

Developers aren’t only interested in building new features. While development teams often prioritize feature development, this is likely due to business-driven initiatives resulting in rapid release schedules. Developers—especially those who have responded to incidents or worked through bug reports—understand the value of resilient software. They just don’t have time to build it.

The problem is that this creates a reliability gap in our applications. Rapidly building features without adequately testing their resilience creates failure modes, leading to problems that developers need to go back and fix at the expense of new projects. The longer we go before finding these failure modes, the more likely we are to experience unexpected behaviors and outages in our applications, and the more expensive a solution will be.

Increasing development velocity without testing for reliability increases the likelihood of failures, causing our expected availability to drop.

To prevent this gap from forming, we need to prioritize and provide the appropriate tooling so developers can test their code and improve the resilience of our applications. This leads to our next truth.

Truth 3: You Have Enough Time for Chaos Engineering

Chaos engineering saves you the time you would spend responding to, troubleshooting and fixing production incidents. As you build resilience through chaos engineering, you’ll reduce the engineering time, effort and costs spent on incidents and can focus more on your core competencies.

That said, adopting chaos engineering does require some time investment. Engineers will need to learn the principles and practices, adopt new tools, run experiments and implement fixes. However, this is relatively minor compared to the cost of production outages. Without chaos engineering, your engineers will need to pause their regular work and rush to restore service. This is time they could be spending on feature development and other value-creating tasks.

A few minutes of proactive testing can uncover defects that would’ve taken hours to fix in production.

Truth 4: Your Systems Support Chaos Engineering

You don’t need to use specific hardware, software, or cloud platforms to benefit from chaos engineering. While some tools have specific requirements, most only require access to the host operating system. Some can even target specific resources, such as containers.

Truth 5: You Don’t Need Comprehensive Observability (But It Helps)

You don’t need a comprehensive monitoring or observability practice in place to benefit from chaos engineering. Basic metrics provided for free by your cloud provider can still provide important insights into how your systems are responding to experiments, and some chaos engineering tools provide basic monitoring for you.

That said, chaos engineering provides significantly more value when used with monitoring and observability. We want to know how our systems are responding to our experiments, and the way we do that is by collecting metrics and traces. Even if you’re currently implementing a monitoring solution, chaos engineering can help streamline this process by helping you determine which metrics to track, fine-tune your alerts and avoid alarm fatigue.

Truth 6: Chaos Engineering Pays for Itself

Like any new tool or practice, chaos engineering requires an upfront investment of time and money. Solutions need to be evaluated and purchased, and engineers need time to warm up to the practice. However, the value it provides far exceeds these costs. Using chaos engineering can help you:

Avoid losses in revenue and productivity due to incidents and engineering time spent resolving incidents.
Improve your ability to generate sales by helping you build reliable systems and prepare for periods of increased demand.
Save costs by helping you optimize resource allocation and test for non-essential infrastructure.

Truth 7: You Can Get Started Today

Getting started with chaos engineering can be as easy as deploying an agent to your systems and choosing a resource to target for experimentation. You don’t need to have a stable, mature SRE practice or enterprise-grade infrastructure. In fact, you can use chaos engineering to help establish your SRE practice or accelerate migration to a new platform.

The most important truth is that no system is perfect, and there are failure modes hidden all throughout our applications and infrastructure. Chaos engineering lets us uncover these failure modes and transform reliability from myth to reality.