Using Chaos Engineering to Build Resilient Systems

Over the past decade, chaos engineering has become one of the most popular approaches in DevOps. It’s uniquely adapted to complex cloud-based systems and has the potential to succeed where more conventional approaches may not.

Chaos Engineering Explained

Traditionally, DevOps teams worked with systems hosted on single servers. They would analyze logs with a fine-toothed comb, looking for potential anomalies. They implemented automated tests to ensure the system worked to spec. They employed root cause analysis to debug applications.

Modern systems use cloud-based distributed architectures and are much too complex for this approach to be successful; they contain hundreds, and sometimes thousands of instances. In many cases, their behavior can be chaotic, making it nearly impossible to predict.

Because of this, modern systems need to be designed for resiliency from the outset. Chaos engineering was developed to foster resilient system design by deliberately anticipating failure modes.

Two Steps to Chaos

Here’s how it works. After characterizing a state of “health” for their system, DevOps teams perform a chaos experiment. Just like physics or chemistry experiments, chaos experiments involve setting up two versions of a system: the experimental version and the control.

The experimental system is then subjected to a barrage of virtual kicks and punches which could range from injecting faults to disrupting a data center. DevOps teams then pore through the resulting logging data, attempting to understand how the chaos experiment affected the state of the system.

Chaos engineering has two major differences from conventional DevOps practice. First, while conventional forms of testing supply a controlled input and test for a predefined output, chaos experiments are far more open-ended. They’re designed from the outset to expect the unexpected, requiring analysis of the whole system to give novel insights.

Second, they are performed on live systems. This makes chaos engineering controversial, as its experiments have the potential to affect real users. Still, it makes it much more powerful than alternative approaches. Chaos engineering encourages resilient design by taking genuine risks with systems in the wild.

Why Companies Embrace Chaos Engineering

Chaos engineering has been around for over a decade. In that time, it’s been adopted by several of the biggest names in tech. Netflix, the company that pioneered the approach, is still the most well-known for their use of fault injection.

In 2008, Netflix was moving from an in-house data center to the cloud. They knew that, as they expanded, their application would undergo massive horizontal scaling. Because any one of its thousands of instances could fail at any moment, it needed to be resilient.

They created the first chaos experiment, Chaos Monkey. This worked by randomly turning off instances in Netflix’s live application to check it was fault-tolerant. This practice sounds dangerous, but it encouraged Netflix engineers to design resilient system architectures.

Chaos Monkey sounds bananas, but it worked. In its first five years of operation, there was only one failure which was promptly fixed by an on-call engineer. That failure was caused by Chaos Monkey itself.

Facebook is another big company to adopt chaos engineering. They created Project Storm in the wake of 2012’s Hurricane Sandy. Although the hurricane itself didn’t affect Facebook, it served as a wake-up call. A storm hitting one of Facebook’s data centers could wreak havoc.

In 2014, Facebook took the step of turning off traffic to one of their data centers. Although no end users were affected, traffic loading went haywire on all sorts of different subsystems.

Once Facebook had successfully identified this weakness in their system’s resilience, they could preemptively design traffic management software that would be resilient in the face of a live data center shutdown.

As more and more companies use cloud computing and hyperscale data centers, chaos engineering is likely to become an essential feature of DevOps strategy.

Applying Chaos Engineering In Your Company

Chaos engineering is the ideal approach for complex, distributed systems. Chances are, your DevOps team either runs their application in the cloud or is planning to migrate to it.

Serverless computing, in the guise of systems like Kubernetes and Heroku, is popular with an increasing number of organizations. Now it’s common for systems to contain hundreds or even thousands of instances, hosted in massive data centers.

So far, we’ve only seen chaos engineering from 10,000 feet. It’s time to look at how you can make it work on the ground.

First, we need to understand the roles observability and monitoring play in chaos engineering. To successfully implement chaos experiments, there needs to be a high level of system visibility.

Dealing with Data

Marius Moscovici outlines four strategies for extracting valuable insights from data. He discusses the importance of detecting anomalies in your data stream, something a chaos engineer needs to be good at.

Whether it’s the haywire traffic in Facebook’s Project Storm or the deactivation of Netflix instances, chaos experiments rely on the ability to diagnose unusual patterns in data.

1. Profiling Normal System Behavior

In order to detect anomalies, you need to have a good idea of what your system normally does. In their book on chaos engineering, Netflix used the analogy of human vital signs. Just as doctors know that 98.6°F is a normal human body temperature, their IT counterparts need to characterize a “normal” state of “health” for their DevOps system.

Many log analysis and observability tools in the market offer machine learning capabilities to help achieve this.

2. Detecting Anomalies Across The Board

Anomaly detection should be integrated into every step of your data analysis. You need to be able to detect anomalies across the board, not just in places you happen to be looking at. Because chaos engineering, unlike more conventional forms of testing, is designed to reveal new and unexpected information it’s vitally important to have a completely open data strategy.

3. Data Scientists On the Lookout

Always be on the lookout for potentially anomalous behavior. Chaos engineering will require your company to take its DevOps strategy to a new level, both in terms of your team’s mindset and the sophistication of their logging tools.

Because chaos engineering requires you to analyze all available data without making assumptions, it’s necessary to leverage machine learning to extract patterns from your logs. Chaos engineering requires a scientific mindset and an appreciation of the power of data.

4. Create Alerts

Finally, make sure that anomalous behavior is being flagged and that the team is aware when it occurs. If possible, dynamic alerts should be used to avoid false positives and alert fatigue which is common with static thresholds.

Engineering with Chaos

More and more organizations are migrating to the cloud and building highly distributed application infrastructures. DevOps teams are being forced to confront chaos and complexity on a daily basis.

While the complexity of modern IT systems has necessitated chaos engineering, it has also enabled it. In a chaotic world, chaos engineering brings the order of an engineering approach.

Using Chaos Engineering to Build Resilient Systems

Chaos Engineering Explained

Two Steps to Chaos

Why Companies Embrace Chaos Engineering

Applying Chaos Engineering In Your Company