AWS Outage Exposes Weaknesses of DevOps Resilience

The December 7, 2021 Amazon Web Services (AWS) outage severely disrupted services from a wide range of businesses for more than five hours and highlighted just how reliant businesses have become on internet-delivered services. The outage mostly impacted web services in the eastern U.S., yet the implications are universal: It’s a reminder that many businesses blindly ignored the old axiom about putting all your eggs in one basket and instead are relying on a provider with a single point of failure.

Services ranging from airline booking systems to streaming video to e-commerce were disrupted during the outage, causing millions of dollars in lost revenue and countless hours in productivity. One of the more interesting aspects of the outage is the impact it had on services from collaboration vendors such as Slack, Trello, Asana and Smartsheet—tools that many development and DevOps teams have come to rely on.

Furthermore, core AWS services, such as the company’s Elastic Compute and DynamoDB cloud tools were also impacted, disrupting many third-party services and severely hampering business processes that use those services. While the obvious victims of the outage are well known, like Amazon’s own e-commerce operation, there is a troubling undercurrent: The disruption to DevOps frameworks and those using them.

AWS has been rather tight-lipped about the root cause of the outage thus far; however, there are still many lessons to be learned for the DevOps community and questions that must be asked such as “Can the DevOps process survive during IaaS/SaaS disruptions?” and “Can multi-cloud failover solutions be baked into the applications DevOps builds?”

There are no simple answers to those questions, but the outage highlights the need to understand the underlying architecture and framework of a deployed DevOps system. Take, for example, how many developers in the DevOps community have embraced SaaS tools to accelerate the development process and to feed CI/CD pipelines. SaaS applications such as code scanners, pipeline orchestration and even IDEs have become common in the world of DevOps. But has anyone bothered to ask what happens if a single one of those tools fails?

What’s more, the reliance on SaaS tools in the development process has led to the creation of potential liabilities in the applications created by DevOps developers. DevOps applications have come to rely on APIs, are often driven by microservices and are frequently deployed into containers that run on SaaS. If any of those elements become non-functional, numerous applications could fail, ultimately putting the onus on developers to explain why they are creating applications with a single point of failure.

Moving forward, the DevOps community needs to take a serious look at the components of their frameworks and determine if there are any single points of failure—including their IaaS/SaaS providers. While it may be impossible to remediate every single one, there is a lesson to be learned about how fragile the development process can become if no one bothers to build an inventory of the tools used and take into account how a failure of any one of those tools could impact workflow. There are numerous examples of how an external failure of a single component impacted the functionality of an application, while the discovery process of the root cause has taken days or even weeks. This can be mostly attributed to not only a lack of knowledge, but a lack of visibility into the components used.

Those lessons can be extended to the development process itself, where the best practice of rooting out single points of failure can be extended to the applications themselves. Leveraging that intelligence starts with understanding the concept of a software bill of materials (SBoM), a piece of supporting documentation that is becoming increasingly important to the purveyors of applications. A properly-defined SBoM reveals all of the components (libraries, APIs, etc.) that are baked into an application and can be used as a map to define where weaknesses may lie.

For the DevOps community, the recent AWS outage has become a clarion call to look inward and discover how the applications they are building may be part of the problem. With continuity and resiliency becoming major topics in the IT and business realm, it’s about time that DevOps practitioners start to look at how they can support both of those business-critical needs. The days of finger-pointing to shift blame must come to an end, and if businesses that rely on software want to grow, someone needs to take responsibility for providing answers when outages occur and learn from those outages to create applications that are more resilient.