Why You Need Chaos Engineered into Your Hybrid Cloud Infrastructure

Nearly a decade ago, I began working at Amazon as an engineer. Cloud technology was just starting to revolutionize how companies launch applications, and early adopters were seeing the benefits of not having to purchase, provision and manage their own hardware. On the flip side, larger companies (especially those that store sensitive information) were still hesitant to fully embrace the black box that was “the cloud,” holding onto the belief that for critical applications, the cloud was neither reliable nor secure.

Fast forward to today, and all of that has changed. We are seeing government institutions, banks, and hospitals all moving their infrastructure over to the cloud. The benefits of being agile, innovative, and highly efficient have mostly overshadowed the concerns around the cloud’s reliability. According to Cloud Technology Partners, cloud adopters see an average reduction of total cost of ownership (TCO) of around 40 percent. And top-performing banks are 88 percent more likely to implement a hybrid cloud solution than their underperforming peers, cites a recent report from IBM Institute for Business Value.

Long story short, the question is no longer “if” but “which”—and there are three major options: AWS, GCP and Azure. Today, many companies are doing their best to avoid vendor lock-in, opting instead for a distributed strategy, where teams are empowered to choose the best option for what they are working on. While having a hybrid approach helps avoid vendor lock-in, it also can save teams from disaster if one of the cloud providers has a major outage like the great S3 outage of 2017.

But just because you have a hybrid strategy—running active-passive across different cloud providers—doesn’t mean you’re home-free. Unless you’ve actually tested that your passive stack will jump in to save the day in the event of a crisis, you’re more or less closing your eyes and crossing your fingers in the hopes that your active stack never goes down. Think of it like running a fire drill: How can you be sure everything runs smoothly during a time of disaster if you never actually run the drill?

We need to shift how we think about operations to place more of the burden of Ops work upfront. While it’s important to reduce MTTR (mean time to resolution), that still means there has been an incident that has affected customers and potentially your bottom line. As someone who has been a call-leader for Amazon and Salesforce, I know firsthand that we could save ourselves a lot of headaches by being more proactive about running experiments in a controlled environment and focusing on reducing the MTBF (mean time between failures).

This is why it is important to have a chaos engineering solution that can run experiments across all of your environments regardless of their build, technology or provider in one simple to use, unified experience. By consistently testing that your “failsafe” actually behaves the way you want it to, you can be much more confident that the operations work you’ve put in to make your architecture more resilient isn’t all for naught when a real-world outage comes calling—and it will.

— Matthew Fornaciari