Improving Resiliency by Creating Chaos

In the digital economy, preventing downtime is paramount. When digital systems fail, the consequences for business can be huge. The cost of downtime can run to thousands of dollars per minute for large businesses. That’s without taking into account the impact of customer dissatisfaction and reputational damage for the company and the IT careers involved.

No matter how you measure it, IT failure is costly. It’s also largely unavoidable due to the increasing complexity and interdependence of today’s distributed IT systems. The combination of cloud computing, microservices architectures and bare-metal infrastructure create many moving parts and potential points of failure, making those systems anything but predictable.

That IT faults will occur is easy to acknowledge. Understanding how to fix them is harder, especially when interdependencies are not always obvious. Until recently, build testing has been the go-to method for assuring quality and resilience, but this kind of testing does not take into account environmental factors that cause unpredictable events or how cascading failures that lie dormant and unnoticed can trigger larger failures.

How Chaos Engineering can Help

Chaos engineering is a relatively new approach to enterprise software development and testing designed to eliminate that unpredictability.

Introducing chaos into a system may sound counterintuitive if your end goal is to get clarity and improve resilience. Indeed, if you have heard anything about chaos engineering, you may have been alarmed at some of the terminologies: “blast radius,” “random terminations,” “fault injection” and “storms,” to name a few.

In practice, chaos engineering is about performing controlled experiments in a distributed environment so that digital engineering teams can build confidence in the system’s ability to tolerate inevitable future failures.

How it Works

The process of chaos engineering involves stressing applications in testing or production environments by creating disruptive events in a controlled manner, such as server outages or API throttling. By observing how the system responds, improvements can be made before those weaknesses affect real customers.

Experiments are meticulously planned from initial scoping to execution and the insights they deliver are far-reaching.

Chaos planning starts with identifying the target deployment for the experiment. This process requires a comprehensive review of the application architecture and infrastructure components to first define what we call steady-state behavior. In other words, you need to understand what “normal” looks like before you start experimenting.

You can then form a hypothesis about how the system will behave during the disruptive event. You’ll need buy-in from the business areas you are potentially disrupting and you’ll need to plan the parameters of the test carefully, reducing the scope if necessary.
It’s a good idea to start small with chaos experiments. You’ll need to replicate them many times over in any given system to properly test its resiliency.

Tooling for Experiments

Luckily, many different tools already exist to help organizations implement and manage planned disruptions. Chaos engineering as we know it today originated back in 2010 from experiments conducted at Netflix using the tool Chaos Monkey, which still exists and is used today.

Nowadays, there are many more chaos offerings available, including services from Microsoft Azure and AWS as well as Gremlin, ToxiProxy, Litmus and many more. Organizations can choose tools tailored to the size of their environment and decide just how automated they want the process to be. Tool selection will also depend on whether experiments are designed to test the system at an infrastructure, network or application level.

Chaos Culture

Chaos engineering is much more than a set of tools and rules. It involves adopting a culture in which teams trust each other and collaborate to build resiliency and advance innovation.

When it comes to thinking about this culture shift, it can be helpful to think back to when DevOps was new. Sure, people would say they were using DevOps tools, but that did not necessarily mean they were actually practicing DevOps. DevOps involves breaking down siloes between different groups in an organization, creating an atmosphere of trust and enabling collaboration–some of the same attributes a chaos engineering culture needs to have.

Why you Need Chaos Engineering

Although chaos engineering sounds like a disruptive or uncontrolled exercise, it is actually the opposite.

Chaos experiments require meticulous planning with an emphasis firmly on rooting out failures before they become outages. Far from lacking in control, chaos testing is a closely coordinated process and the organization retains a firm grip on everything from the speed at which testing happens to what components are tested. Chaos engineering doesn’t create problems, it reveals them.