AWS Outage and App Resiliency: Did a Roomba Replace the Canary?

Canaries were once sent into coal mines as an early warning sign against danger—for me, it was my Roomba failing to automatically search out dog hair and clean the floor under my sons’ dining room chairs. Cloud-enabled apps also were acting “weird”. Some were down, while others were just slow—although, in my view, slow may as well be down; who has the tolerance to wait for an application to load?

What caused this headache felt across the internet? An AWS disruption.

In the days since the December 7 outage, I’ve had a number of conversations with CIOs, colleagues and other industry analysts and many of our shared concerns remain unanswered. Here are a few quick thoughts on the latest us-east-1 outage.

Understanding AWS SLAs

It’s important to remember that AWS requires instances/apps be deployed across two or more regions for redundancy and resiliency. And an SLA is not only a way for service consumers to gauge the value they’re getting for their money, it’s also the contractual agreement between a cloud service provider (CSP) and a customer. It outlines the uptime, availability and performance standards to which a cloud service will adhere. It may also specify that the service provider’s helpdesk will respond to an outage within a set amount of time. Amazon has very clear SLAs. If a customer receives less than 99.99% uptime in a given month, the company will provide credits to their account.

However, there are three things to keep in mind:

A 99.99% uptime agreement still means you are willing to tolerate services being down for almost four and a half minutes per month (53 minutes per year), and you must architect your AWS instances and applications in a way that meets the AWS SLAs. The third point is that, in AWS’s view, the outage wasn’t their fault.

Outages Don’t Just Impact Startups and Organizations With Poor Planning Practices

The cloud allows organizations to regularly make budget versus resiliency decisions. However, the recent AWS outage didn’t just impact born-on-the-web startups. Mature businesses like Disney, McDonald’s and Amazon itself were impacted by the outage. These technology-first companies aren’t sacrificing budget over resiliency. They understand that they are technology companies—they have large IT budgets that represent this priority and were still blindsided.

The recurring question is why all these organizations were crippled by a single Amazon region going down? Was it poor architecture? Did these AWS customers not fully understand AWS’ SLAs? Did they choose budget over resiliency? 

Amazon as an Enterprise Platform

Adam Selipsky’s keynote and industry analyst Q&A at re:Invent 2021 made it clear that AWS is completely focused on the enterprise. Amazon understands the needs of the DevOps community, but they are still working on developing their approach to the enterprise. Google Cloud recognized the same shortcomings in its own organization and brought Thomas Kurian from Oracle to build up the company’s enterprise and IT industry persona.

Everyone in the technology industry understands that outages and slow performance happen, but they expect service providers to provide best practices and avoid downtime. In response to Selipsky’s re:Invent keynote, Linda Jojo, executive vice president of technology and chief digital officer for United Airlines, said that the airline settled on a “one cloud” approach. Surely United Airlines can’t tolerate regular downtime.

My View

December 7, 2021 won’t live in infamy for AWS; we have seen plenty of other outages.

Application resiliency is still an emerging area of focus. In the past, the focus was on platform and data center uptime; however, in many cases, developers have no control over the infrastructure. Therefore, teams need to focus on the resiliency of an application and assume that the underlying IaaS (cloud infrastructure) will fail for one reason or another.

In addition, Amazon needs to offer more opinionated offerings. Disaster recovery and failover should be built into every level of AWS. Software companies shouldn’t be forced to sign up for multi-region deployments, but AWS should at least be prescriptive with best practices that advise such a practice.

Dan Kirsch

Daniel (Dan), Managing Director and Co-Founder of Techstrong Research, is a consultant, IT industry analyst and thought leader focused on how emerging technologies such as AI, machine learning and advanced analytics are impacting businesses. Dan is particularly interested in how businesses use these emerging technologies to alter their approaches to information security, governance, risk and ethics. Dan provides advisory services directly to leadership at technology vendors that design and deliver security solutions to the market. He assists them in aligning their solutions with enterprise requirements. Dan is viewed as an expert in understanding security solutions and mapping them to the complex needs of businesses across industries. Prior to co-founding Techstrong Research, Dan was managing director at Hurwitz & Associates, an analyst and consulting firm. At Hurwitz & Associates Dan led research on a variety of studies and reports in the areas of data and AI, modern software development, security and multi-cloud computing. Dan earned his B.A. in Political Science from Union College in New York and a J.D. from Boston College Law School, where he focused on emerging corporate strategies and intellectual property. As an attorney, Dan represented start-ups, cloud computing ventures, early stage startups as they sought funding. Dan is a co-author of Augmented Intelligence: The Business Power of Human-Machine Collaboration (CRC Press, 2020), Cloud for Dummies (John Wiley & Sons 2020), and Hybrid Cloud for Dummies (John Wiley & Sons, 2012).

Recent Posts

Trilio Announces Technical Preview of ‘Continuous Restore’, Delivering Cloud-Native Application Portability and Recoverability in Seconds Across Disparate Infrastructure

Capability Unlocks Data Gravity and Frees Data-Driven Organizations to Quickly Replicate Production-Grade Cloud-Native Applications Anywhere Valencia, SPAIN, KubeCon + CloudNativeCon…

25 mins ago

Red Hat Releases Open Source StackRox to the Community

Red Hat is excited to announce that Red Hat Advanced Cluster Security for Kubernetes is now available as an open…

30 mins ago

NetFoundry Embeds Zero Trust Into Prometheus for Secure Monitoring Anywhere

Charlotte, NC, May 17, 2022 – NetFoundry is celebrating Prometheus Day with native secure networking connectivity for the leading open-source…

30 mins ago

Application Modernization Report Shows Need For Kubernetes-Specific Migration Tooling

Konveyor is a community of people passionate about helping others modernize and migrate their applications to the hybrid cloud by…

30 mins ago

Service Meshes Are on the Rise – But Greater Understanding and Experience Are Required

CNCF conducted a microsurvey of the cloud native community at the end of 2021 to discover how organizations are adopting…

4 hours ago

Fairwinds Insights Latest Release Unifies DevSecOps with Additional Shift-Left Security Enhancements

Kubernetes governance platform adds automated Infrastructure-as-Code scanning and an enhanced GitHub integration so DevSecOps teams can find and fix misconfigurations faster…

7 hours ago