Leadership Suite

Chaos Engineering for ITOps

Chaos engineering (CE) is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production. This approach is becoming commonplace in software development and operations (DevOps) practices. But how would its application extend to ITOps? CE for ITOps offers a similar framework for stress-testing a technology platform to understand its weak points and performance pitfalls under heavy pressure.

CE tends to be used primarily in DevOps during bug testing: setting up experiments to run software under different conditions, such as peak traffic, and monitoring how it functions and performs. This becomes increasingly necessary in cloud-based systems where failure to understand extreme load responses could result in runaway cascade failures or, worse yet, spinning up thousands of extra nodes handling error conditions while not doing any actual work. These same principles, applied to IT operations management (ITOM), help define a functional baseline plus tolerances for infrastructure, policies and processes by clarifying both steady-state and chaotic outputs when extremes are reached.

Applications in IT

The theory of CE in DevOps gained early traction at Netflix as they moved from physical to virtual infrastructure, with the team that implemented it on AWS breaking off to form Gremlin. However, CE is not typically used in ITOps, because ITOM has historically been separated from development (generally, IT monitors system dynamics, and when a problem occurs, engineering change management or ITSM is brought in to remediate the issue).

With the growth of containerization in cloud applications today, IT infrastructure looks more like development environments than classical multi-tier architectures. But the limitless scale of the cloud means failures can also be limitless: Microservices are well-served by testing elasticity and scalability, data flows and resiliency through stressing the system to the edge of its tolerances and fixing their shortcomings before a public crash.

One, Two, Three…Chaos

Implementing chaos engineering for ITOM provides a systematic approach to identifying weaknesses in a microservices-world. In a monolithic environment, you have visibility into performance and event metrics that may be lost with microservices designs. As a result, the need for operational insights becomes even more critical when scaling to unknown workloads.

Netflix’s Chaos Monkey grew out of CE principles from their own cloud-native community, meant to address the gaps in common dev tools’ abilities to manage extreme complexities. This methodology is extendable to infrastructure and helps to set guardrails on platform behavior as a whole.

Here are five fundamental steps to follow in order to bring this thinking into your team’s ITOM.

Define the Current Steady State

Performing baseline analysis is a standard concept in capacity planning, upgrade strategies and other high-impact functions. Start with something relatively simple (and small) so you don’t get overwhelmed by the data, or risk interfering with the business if something goes wrong (such as security Red Teaming). For example, monitoring CPU and network utilization, which are common bottlenecks in any IT shop.

Define Optimal Conditions

There’s how your system generally operates, and then there’s how it should operate; these typically aren’t the same thing. CPU utilization and network latency are always affected by application efficiencies, hardware conditions and a host of other factors. Create a standard that outlines what engineers should expect on a normal day, on an easy day and on a very hard day. These are the control groups, and the extreme day will be the stress test.

Form a Hypothesis

Where will the system break? If you’re running an application scenario such as doubling the peak traffic that even your worst day so far has seen, will your CPU maintain optimum utilization (or will the container provisioning engine smoothly deploy additional nodes) as in the variable control groups, or will it spike so severely that processes grind to a halt because there isn’t enough memory or network bandwidth left to manage the load?

Execute a Real-World Event (But Contain the Blast Radius)

Do something extreme, such as taking down a firewall that severs connectivity to one internet service provider. This will confuse the application as it tries to respond to requests with repeated failures, ramping up CPU processes as errors return from a dead network endpoint. Log events will mount, filling the database and saturating the backbone.

Validate the Hypothesis

What happened? Monitor utilization and network throughput during the test and see where the system fell over. Is it what you expected, or did something never previously considered take place? Did new chaos erupt from the fissures in your infrastructure? Stabilize, document and remediate.

Never Stop Not Being Afraid

Stressing a system to its absolute max—and a little bit further—to see where things go wrong allows you to understand steady-state behavior and error-handling, so you can fix it before something breaks in new and unexpected ways. What do traffic spikes look like? What are real-world events and their impacts on your organization?

CE is not just for DevOps. It should be a systemic practice for load-testing (out of your comfort zone) to the point of failure. It’s a responsibility for more than microservices deployments and applies to all sorts of disciplines within the IT organization.

Michael Fisher

Michael Fisher

Michael Fisher is product manager at OpsRamp. Michael is a creative and collaborative PM with a passion for observability, performance monitoring and machine learning technologies.

Recent Posts

The Role of AI in Securing Software and Data Supply Chains

Expect attacks on the open source software supply chain to accelerate, with attackers automating attacks in common open source software…

2 hours ago

Exploring Low/No-Code Platforms, GenAI, Copilots and Code Generators

The emergence of low/no-code platforms is challenging traditional notions of coding expertise. Gone are the days when coding was an…

20 hours ago

Datadog DevSecOps Report Shines Spotlight on Java Security Issues

Datadog today published a State of DevSecOps report that finds 90% of Java services running in a production environment are…

2 days ago

OpenSSF warns of Open Source Social Engineering Threats

Linux dodged a bullet. If the XZ exploit had gone undiscovered for only a few more weeks, millions of Linux…

2 days ago

Auto Reply

We're going to send email messages that say, "Hope this finds you in a well" and see if anybody notices.

2 days ago

From CEO Alan Shimel: Futurum Group Acquires Techstrong Group

I am happy and proud to announce with Daniel Newman, CEO of Futurum Group, an agreement under which Futurum has…

2 days ago