DevOps.com

  • Latest
    • Articles
    • Features
    • Most Read
    • News
    • News Releases
  • Topics
    • AI
    • Continuous Delivery
    • Continuous Testing
    • Cloud
    • Culture
    • DataOps
    • DevSecOps
    • Enterprise DevOps
    • Leadership Suite
    • DevOps Practice
    • ROELBOB
    • DevOps Toolbox
    • IT as Code
  • Videos/Podcasts
    • Techstrong.tv Podcast
    • Techstrong.tv - Twitch
    • DevOps Unbound
  • Webinars
    • Upcoming
    • Calendar View
    • On-Demand Webinars
  • Library
  • Events
    • Upcoming Events
    • Calendar View
    • On-Demand Events
  • Sponsored Content
  • Related Sites
    • Techstrong Group
    • Cloud Native Now
    • Security Boulevard
    • Techstrong Research
    • DevOps Chat
    • DevOps Dozen
    • DevOps TV
    • Techstrong TV
    • Techstrong.tv Podcast
    • Techstrong.tv - Twitch
  • Media Kit
  • About
  • Sponsor
  • AI
  • Cloud
  • CI/CD
  • Continuous Testing
  • DataOps
  • DevSecOps
  • DevOps Onramp
  • Platform Engineering
  • Sustainability
  • Low-Code/No-Code
  • IT as Code
  • More
    • Application Performance Management/Monitoring
    • Culture
    • Enterprise DevOps
    • ROELBOB
Hot Topics
  • Atlassian Advances DevSecOps via Jira Integrations
  • PagerDuty Signals Commitment to Adding Generative AI Capabilities
  • Mastering DevOps Automation for Modern Software Delivery
  • DigiCert Allies With ReversingLabs to Secure Software Supply Chains
  • The Future of Continuous Testing in CI/CD

Home » Blogs » Improving Resiliency by Creating Chaos

Improving Resiliency by Creating Chaos

Avatar photoBy: Dirghayu Dave on August 5, 2021 Leave a Comment

In the digital economy, preventing downtime is paramount. When digital systems fail, the consequences for business can be huge. The cost of downtime can run to thousands of dollars per minute for large businesses. That’s without taking into account the impact of customer dissatisfaction and reputational damage for the company and the IT careers involved.

No matter how you measure it, IT failure is costly. It’s also largely unavoidable due to the increasing complexity and interdependence of today’s distributed IT systems. The combination of cloud computing, microservices architectures and bare-metal infrastructure create many moving parts and potential points of failure, making those systems anything but predictable.

Cloud Native NowSponsorships Available

That IT faults will occur is easy to acknowledge. Understanding how to fix them is harder, especially when interdependencies are not always obvious. Until recently, build testing has been the go-to method for assuring quality and resilience, but this kind of testing does not take into account environmental factors that cause unpredictable events or how cascading failures that lie dormant and unnoticed can trigger larger failures.

How Chaos Engineering can Help

Chaos engineering is a relatively new approach to enterprise software development and testing designed to eliminate that unpredictability.

Introducing chaos into a system may sound counterintuitive if your end goal is to get clarity and improve resilience. Indeed, if you have heard anything about chaos engineering, you may have been alarmed at some of the terminologies: “blast radius,” “random terminations,” “fault injection” and “storms,” to name a few.

In practice, chaos engineering is about performing controlled experiments in a distributed environment so that digital engineering teams can build confidence in the system’s ability to tolerate inevitable future failures.

How it Works

The process of chaos engineering involves stressing applications in testing or production environments by creating disruptive events in a controlled manner, such as server outages or API throttling. By observing how the system responds, improvements can be made before those weaknesses affect real customers.

Experiments are meticulously planned from initial scoping to execution and the insights they deliver are far-reaching.

Chaos planning starts with identifying the target deployment for the experiment. This process requires a comprehensive review of the application architecture and infrastructure components to first define what we call steady-state behavior. In other words, you need to understand what “normal” looks like before you start experimenting.

You can then form a hypothesis about how the system will behave during the disruptive event. You’ll need buy-in from the business areas you are potentially disrupting and you’ll need to plan the parameters of the test carefully, reducing the scope if necessary.
It’s a good idea to start small with chaos experiments. You’ll need to replicate them many times over in any given system to properly test its resiliency.

Tooling for Experiments

Luckily, many different tools already exist to help organizations implement and manage planned disruptions. Chaos engineering as we know it today originated back in 2010 from experiments conducted at Netflix using the tool Chaos Monkey, which still exists and is used today.

Nowadays, there are many more chaos offerings available, including services from Microsoft Azure and AWS as well as Gremlin, ToxiProxy, Litmus and many more. Organizations can choose tools tailored to the size of their environment and decide just how automated they want the process to be. Tool selection will also depend on whether experiments are designed to test the system at an infrastructure, network or application level.

Chaos Culture

Chaos engineering is much more than a set of tools and rules. It involves adopting a culture in which teams trust each other and collaborate to build resiliency and advance innovation.

When it comes to thinking about this culture shift, it can be helpful to think back to when DevOps was new. Sure, people would say they were using DevOps tools, but that did not necessarily mean they were actually practicing DevOps. DevOps involves breaking down siloes between different groups in an organization, creating an atmosphere of trust and enabling collaboration–some of the same attributes a chaos engineering culture needs to have.

Why you Need Chaos Engineering

Although chaos engineering sounds like a disruptive or uncontrolled exercise, it is actually the opposite.

Chaos experiments require meticulous planning with an emphasis firmly on rooting out failures before they become outages. Far from lacking in control, chaos testing is a closely coordinated process and the organization retains a firm grip on everything from the speed at which testing happens to what components are tested. Chaos engineering doesn’t create problems, it reveals them.

Related Posts
  • Improving Resiliency by Creating Chaos
  • Harness Acquires ChaosNative to Meld Chaos Engineering, DevOps
  • 3 Ways to Minimize the Impact of High Severity Incidents
    Related Categories
  • Blogs
  • Chaos Engineering
  • Editorial Calendar
    Related Topics
  • chaos
  • chaos engineering
  • SRE
Show more
Show less

Filed Under: Blogs, Chaos Engineering, Editorial Calendar Tagged With: chaos, chaos engineering, SRE

« Executive Placement
4 Steps to More Agile Operations »

Techstrong TV – Live

Click full-screen to enable volume control
Watch latest episodes and shows

Upcoming Webinars

Maximize IT Operations Observability with IBM i Within Splunk
Wednesday, June 7, 2023 - 1:00 pm EDT
Secure Your Container Workloads in Build-Time with Snyk and AWS
Wednesday, June 7, 2023 - 3:00 pm EDT
ActiveState Workshop: Building Secure and Reproducible Open Source Runtimes
Thursday, June 8, 2023 - 1:00 pm EDT

GET THE TOP STORIES OF THE WEEK

Sponsored Content

PlatformCon 2023: This Year’s Hottest Platform Engineering Event

May 30, 2023 | Karolina Junčytė

The Google Cloud DevOps Awards: Apply Now!

January 10, 2023 | Brenna Washington

Codenotary Extends Dynamic SBOM Reach to Serverless Computing Platforms

December 9, 2022 | Mike Vizard

Why a Low-Code Platform Should Have Pro-Code Capabilities

March 24, 2021 | Andrew Manby

AWS Well-Architected Framework Elevates Agility

December 17, 2020 | JT Giri

Latest from DevOps.com

Atlassian Advances DevSecOps via Jira Integrations
June 6, 2023 | Mike Vizard
PagerDuty Signals Commitment to Adding Generative AI Capabilities
June 6, 2023 | Mike Vizard
Mastering DevOps Automation for Modern Software Delivery
June 6, 2023 | Krishna R.
DigiCert Allies With ReversingLabs to Secure Software Supply Chains
June 6, 2023 | Mike Vizard
The Future of Continuous Testing in CI/CD
June 6, 2023 | Alexander Tarasov

TSTV Podcast

On-Demand Webinars

DevOps.com Webinar ReplaysDevOps.com Webinar Replays

Most Read on DevOps.com

No, Dev Jobs Aren’t Dead: AI Means ‘Everyone’s a Programmer’? ¦ Interesting Intel VPUs
June 1, 2023 | Richi Jennings
Forget Change, Embrace Stability
May 31, 2023 | Don Macvittie
Revolutionizing the Nine Pillars of DevOps With AI-Engineered Tools
June 2, 2023 | Marc Hornbeek
Friend or Foe? ChatGPT’s Impact on Open Source Software
June 2, 2023 | Javier Perez
Checkmarx Brings Generative AI to SAST and IaC Security Tools
May 31, 2023 | Mike Vizard
  • Home
  • About DevOps.com
  • Meet our Authors
  • Write for DevOps.com
  • Media Kit
  • Sponsor Info
  • Copyright
  • TOS
  • Privacy Policy

Powered by Techstrong Group, Inc.

© 2023 ·Techstrong Group, Inc.All rights reserved.