Blogs

AWS Outage and App Resiliency: Did a Roomba Replace the Canary?

Canaries were once sent into coal mines as an early warning sign against danger—for me, it was my Roomba failing to automatically search out dog hair and clean the floor under my sons’ dining room chairs. Cloud-enabled apps also were acting “weird”. Some were down, while others were just slow—although, in my view, slow may as well be down; who has the tolerance to wait for an application to load?

What caused this headache felt across the internet? An AWS disruption.

In the days since the December 7 outage, I’ve had a number of conversations with CIOs, colleagues and other industry analysts and many of our shared concerns remain unanswered. Here are a few quick thoughts on the latest us-east-1 outage.

Understanding AWS SLAs

It’s important to remember that AWS requires instances/apps be deployed across two or more regions for redundancy and resiliency. And an SLA is not only a way for service consumers to gauge the value they’re getting for their money, it’s also the contractual agreement between a cloud service provider (CSP) and a customer. It outlines the uptime, availability and performance standards to which a cloud service will adhere. It may also specify that the service provider’s helpdesk will respond to an outage within a set amount of time. Amazon has very clear SLAs. If a customer receives less than 99.99% uptime in a given month, the company will provide credits to their account.

However, there are three things to keep in mind:

A 99.99% uptime agreement still means you are willing to tolerate services being down for almost four and a half minutes per month (53 minutes per year), and you must architect your AWS instances and applications in a way that meets the AWS SLAs. The third point is that, in AWS’s view, the outage wasn’t their fault.

Outages Don’t Just Impact Startups and Organizations With Poor Planning Practices

The cloud allows organizations to regularly make budget versus resiliency decisions. However, the recent AWS outage didn’t just impact born-on-the-web startups. Mature businesses like Disney, McDonald’s and Amazon itself were impacted by the outage. These technology-first companies aren’t sacrificing budget over resiliency. They understand that they are technology companies—they have large IT budgets that represent this priority and were still blindsided.

The recurring question is why all these organizations were crippled by a single Amazon region going down? Was it poor architecture? Did these AWS customers not fully understand AWS’ SLAs? Did they choose budget over resiliency? 

Amazon as an Enterprise Platform

Adam Selipsky’s keynote and industry analyst Q&A at re:Invent 2021 made it clear that AWS is completely focused on the enterprise. Amazon understands the needs of the DevOps community, but they are still working on developing their approach to the enterprise. Google Cloud recognized the same shortcomings in its own organization and brought Thomas Kurian from Oracle to build up the company’s enterprise and IT industry persona.

Everyone in the technology industry understands that outages and slow performance happen, but they expect service providers to provide best practices and avoid downtime. In response to Selipsky’s re:Invent keynote, Linda Jojo, executive vice president of technology and chief digital officer for United Airlines, said that the airline settled on a “one cloud” approach. Surely United Airlines can’t tolerate regular downtime.

My View

December 7, 2021 won’t live in infamy for AWS; we have seen plenty of other outages.

Application resiliency is still an emerging area of focus. In the past, the focus was on platform and data center uptime; however, in many cases, developers have no control over the infrastructure. Therefore, teams need to focus on the resiliency of an application and assume that the underlying IaaS (cloud infrastructure) will fail for one reason or another.

In addition, Amazon needs to offer more opinionated offerings. Disaster recovery and failover should be built into every level of AWS. Software companies shouldn’t be forced to sign up for multi-region deployments, but AWS should at least be prescriptive with best practices that advise such a practice.

Dan Kirsch

Daniel (Dan), Managing Director and Co-Founder of Techstrong Research, is a consultant, IT industry analyst and thought leader focused on how emerging technologies such as AI, machine learning and advanced analytics are impacting businesses. Dan is particularly interested in how businesses use these emerging technologies to alter their approaches to information security, governance, risk and ethics. Dan provides advisory services directly to leadership at technology vendors that design and deliver security solutions to the market. He assists them in aligning their solutions with enterprise requirements. Dan is viewed as an expert in understanding security solutions and mapping them to the complex needs of businesses across industries. Prior to co-founding Techstrong Research, Dan was managing director at Hurwitz & Associates, an analyst and consulting firm. At Hurwitz & Associates Dan led research on a variety of studies and reports in the areas of data and AI, modern software development, security and multi-cloud computing. Dan earned his B.A. in Political Science from Union College in New York and a J.D. from Boston College Law School, where he focused on emerging corporate strategies and intellectual property. As an attorney, Dan represented start-ups, cloud computing ventures, early stage startups as they sought funding. Dan is a co-author of Augmented Intelligence: The Business Power of Human-Machine Collaboration (CRC Press, 2020), Cloud for Dummies (John Wiley & Sons 2020), and Hybrid Cloud for Dummies (John Wiley & Sons, 2012).

Recent Posts

Exploring Low/No-Code Platforms, GenAI, Copilots and Code Generators

The emergence of low/no-code platforms is challenging traditional notions of coding expertise. Gone are the days when coding was an…

2 hours ago

Datadog DevSecOps Report Shines Spotlight on Java Security Issues

Datadog today published a State of DevSecOps report that finds 90% of Java services running in a production environment are…

21 hours ago

OpenSSF warns of Open Source Social Engineering Threats

Linux dodged a bullet. If the XZ exploit had gone undiscovered for only a few more weeks, millions of Linux…

1 day ago

Auto Reply

We're going to send email messages that say, "Hope this finds you in a well" and see if anybody notices.

1 day ago

From CEO Alan Shimel: Futurum Group Acquires Techstrong Group

I am happy and proud to announce with Daniel Newman, CEO of Futurum Group, an agreement under which Futurum has…

1 day ago

CDF Survey Surfaces DevOps Progress and Challenges

Most developers are using some form of DevOps practices, reports the CDF survey. Adopting STANDARD DevOps practices? Not so much.

2 days ago