Chaos engineering sounds alluring and exciting—it’s fun to experiment, right? But what some misunderstand about this approach is that it’s not about moving fast and breaking things. It’s about designing and introducing disruptions in the software production process that tests the resiliency of the code, much like crash testing in the automotive industry.
If you think about it, this is the logical extension of the way developers like to think anyway: we’re designing software and systems for the real world. Shouldn’t they be able to handle real-world situations?
To design effectively for the real world, developer teams don’t necessarily need to become chaos engineers. This isn’t about training for an entirely new set of skills. Rather, it’s about adopting a chaos mindset—a set of systems and perspectives that lets your team engage in controlled experiments to deliver products that can cope with whatever end-users throw at them.
So, how do you know if your team is ready to adopt chaos engineering principles? And what should you do to help them?
Chaos Engineering: The Basics
This approach is all about using distributed systems as a “safe” experimentation environment, so that dev teams can ensure their products will withstand unexpected turbulence in production. Netflix engineers get the credit for coining the term, based on their experiences developing for Netflix on AWS.
The fun part of chaos engineering is that it lets developers find loopholes in existing systems that allow them to develop for resiliency. By running experiments within existing platforms in a distributed environment, or within a sandbox development environment, teams can supervise and control the experiments without impacting end users.
Some applications, such as transactional applications in finance or trading, should not be disrupted in production, and developers should test in a sandbox environment instead. Chaos engineering is very easy to put in place if your team uses containerization techniques already – which should be part of any modern DevOps CI/CD process anyway. It is common to setup a DevOps CI/CD pipeline to run a battery of tests on an application before it gets to the publish step. Within that testing step, you can easily add some disruptions; some may be harder to put in place and require a more elaborate environment.
Dev teams can use existing environments and systems to show each other how unexpected issues could impact production of a software product, allowing them to collaborate and solve the coding mystery before it slows or stops production entirely.
Applications designed by engineers are subject to common issues, whether the applications run on-premises or in the cloud. We can group these issues into the following categories: hardware limitations, sub system failures, external system failures and software bugs. While we could define more categories, these are the main ones, and the ones that are easy to test.
Chaos Mindset: The Tao of Chaos
Like Agile, chaos engineering is more than a set of activities and workflows—it’s also a state of mind. Your people and your culture must be ready and able to adopt chaos principles, as well as chaos processes.
For the DevOps leader, adopting a new mindset might sound a little, well, vague. But this shift is based on concrete actions, not just philosophical musings.
Consider an example from the world of cloud infrastructure: a mission-critical application that is hosted within a cloud service could be at risk for failure if, say, that cloud service is centralized in a single location, or within a limited number of microservices within the cloud infrastructure. But if the app is hosted in a distributed way, you can create greater opportunity for application-level availability and resilience, and you can test for that resilience within the existing production environment.
This kind of distributed architecture isn’t brand-new for most enterprises, and, therefore, the process of developing applications in way that tests for availability in a variety of infrastructure scenarios also shouldn’t be a foreign concept. As a DevOps leader, you can build a culture of resilience-centric thinking by empowering your teams with the tools they need to adopt chaos-style testing, and then showing them how to build that thinking into every sprint and every standup.
It takes work to train your teams, but you don’t have to do it alone. Netflix’s Simian Army has tons of tools and guidance; Facebook Storm and Amazon Gameday, both war room-type experiences to simulate failures in the cloud, also have helpful examples and ideas.
Chaos Engineering in Real Life
Deploying a chaos-inspired experiment strategy isn’t easy. Start with a simple set of operating principles as a guide for designing your chaos strategy:
- Hypothesis: How we think the system should work, so we know when it doesn’t work.
- Scenario planning: What are the possible failure events or crisis points that you know can happen in the real world?
- Set up the experiments: Design and put in production the tests that measure the code’s strength.
- Configure monitoring: Monitoring will help test the hypothesis and help you see, in real time, the impact and resolution of the disruption.
- Make the experiment run itself: Automating the tests lets you continuously assess both performance and resilience.
- Contain the experiment: You don’t want software that’s already out in the wild to get disrupted by your lab experiments, so make sure to design with guardrails in place.
- Shut-off switch: Even when contained, have a shut-off switch to prevent the disruption from further impacting some users.
The fundamental irony of chaos engineering is that it’s anything but chaotic. It’s a disciplined approach to breaking things that lets your team get smarter, faster, about how their products and applications will work in the real world. It’s a strategy that harmonizes beautifully with Agile practices. And even though it won’t make your apps bulletproof, it can definitely help you dodge the bullets of failure with greater grace and finesse.
You must look at the software architecture as a whole to find the most appropriate chaos engineering testing tool for each portion of the architecture. There isn’t a single tool that can test all the components of your architecture; you will have to use different tools for different parts – and don’t forget monitoring! It’s only through proper monitoring that you can assess how the software is behaving in response to a disruption.
Sometimes, the best approach, and the simplest one, is to build some chaos engineering steps into your software so that the software itself can be disrupted given specific inputs without exposing a security weakness in an obvious way. This is very much like how cloud applications create heartbeat and health check endpoints. We like to think about chaos engineering as a great complement to unit testing and integration testing, except we run it on production systems.