Chaos Engineering for ITOps

Chaos engineering (CE) is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production. This approach is becoming commonplace in software development and operations (DevOps) practices. But how would its application extend to ITOps? CE for ITOps offers a similar framework for stress-testing a technology platform to understand its weak points and performance pitfalls under heavy pressure.

CE tends to be used primarily in DevOps during bug testing: setting up experiments to run software under different conditions, such as peak traffic, and monitoring how it functions and performs. This becomes increasingly necessary in cloud-based systems where failure to understand extreme load responses could result in runaway cascade failures or, worse yet, spinning up thousands of extra nodes handling error conditions while not doing any actual work. These same principles, applied to IT operations management (ITOM), help define a functional baseline plus tolerances for infrastructure, policies and processes by clarifying both steady-state and chaotic outputs when extremes are reached.

Applications in IT

The theory of CE in DevOps gained early traction at Netflix as they moved from physical to virtual infrastructure, with the team that implemented it on AWS breaking off to form Gremlin. However, CE is not typically used in ITOps, because ITOM has historically been separated from development (generally, IT monitors system dynamics, and when a problem occurs, engineering change management or ITSM is brought in to remediate the issue).

With the growth of containerization in cloud applications today, IT infrastructure looks more like development environments than classical multi-tier architectures. But the limitless scale of the cloud means failures can also be limitless: Microservices are well-served by testing elasticity and scalability, data flows and resiliency through stressing the system to the edge of its tolerances and fixing their shortcomings before a public crash.

One, Two, Three…Chaos

Implementing chaos engineering for ITOM provides a systematic approach to identifying weaknesses in a microservices-world. In a monolithic environment, you have visibility into performance and event metrics that may be lost with microservices designs. As a result, the need for operational insights becomes even more critical when scaling to unknown workloads.

Netflix’s Chaos Monkey grew out of CE principles from their own cloud-native community, meant to address the gaps in common dev tools’ abilities to manage extreme complexities. This methodology is extendable to infrastructure and helps to set guardrails on platform behavior as a whole.

Here are five fundamental steps to follow in order to bring this thinking into your team’s ITOM.

Define the Current Steady State

Performing baseline analysis is a standard concept in capacity planning, upgrade strategies and other high-impact functions. Start with something relatively simple (and small) so you don’t get overwhelmed by the data, or risk interfering with the business if something goes wrong (such as security Red Teaming). For example, monitoring CPU and network utilization, which are common bottlenecks in any IT shop.

Define Optimal Conditions

There’s how your system generally operates, and then there’s how it should operate; these typically aren’t the same thing. CPU utilization and network latency are always affected by application efficiencies, hardware conditions and a host of other factors. Create a standard that outlines what engineers should expect on a normal day, on an easy day and on a very hard day. These are the control groups, and the extreme day will be the stress test.

Form a Hypothesis

Where will the system break? If you’re running an application scenario such as doubling the peak traffic that even your worst day so far has seen, will your CPU maintain optimum utilization (or will the container provisioning engine smoothly deploy additional nodes) as in the variable control groups, or will it spike so severely that processes grind to a halt because there isn’t enough memory or network bandwidth left to manage the load?

Execute a Real-World Event (But Contain the Blast Radius)

Do something extreme, such as taking down a firewall that severs connectivity to one internet service provider. This will confuse the application as it tries to respond to requests with repeated failures, ramping up CPU processes as errors return from a dead network endpoint. Log events will mount, filling the database and saturating the backbone.

Validate the Hypothesis

What happened? Monitor utilization and network throughput during the test and see where the system fell over. Is it what you expected, or did something never previously considered take place? Did new chaos erupt from the fissures in your infrastructure? Stabilize, document and remediate.

Never Stop Not Being Afraid

Stressing a system to its absolute max—and a little bit further—to see where things go wrong allows you to understand steady-state behavior and error-handling, so you can fix it before something breaks in new and unexpected ways. What do traffic spikes look like? What are real-world events and their impacts on your organization?

CE is not just for DevOps. It should be a systemic practice for load-testing (out of your comfort zone) to the point of failure. It’s a responsibility for more than microservices deployments and applies to all sorts of disciplines within the IT organization.

— Michael Fisher