Maintaining Control in a Cloud-Dependent Environment

AWith digital transformation changing the face of businesses worldwide, we have increased our reliance on public cloud providers and their services. As we host and deliver our software applications and services to our customers, a failure like the one recently with Amazon’s S3 storage subsystem can paralyze operations for innumerable companies, large and small, around the country and possibly worldwide.

Amazon presumably also had its own internal operations come to a grinding halt, as so much of its platform and delivery infrastructure is likely built using its own services. So much so that from 12:37 p.m. EST till 3:37 p.m. EST, Amazon could not update its SHD (service health dashboard) due to its dependency on S3, and had to resort to Twitter to let customers know what was going on!

The root cause of the outage was a failure in its Simple Storage Service (S3) as is explained here: https://aws.amazon.com/message/41926/. What most people most likely missed, however, is the impact it had on other AWS services that were dependent on the availability of its S3 service.

The multiple AWS services that were also disrupted during the outage included:

Amazon Athena – used to analyze data in S3 using SQL
Amazon Elastic MapReduce (EMR) – Elastic Map Reduce used for big data processing
Amazon Inspector – used to detect vulnerabilities of your applications deployed on AWS
Amazon Kinesis Firehose – used to deliver real-time streaming data to S3, Redshift or Elastic Search
Amazon Simple Email Service (Amazon SES) – used to provide email service for applications running on AWS
Amazon WorkMail – Amazon’s email and calendaring service for businesses
Amazon Auto Scaling – allows scaling your EC2 capacity up and down based on conditions
Amazon CloudFormation – allows creating and managing a collection of related AWS resources

Multiple services that saw degradations to the point that their performance was severely impaired during the outage window included:

Amazon AppStream – streaming service from the cloud for Windows apps
Amazon CloudSearch – managed search service
Amazon Cognitio – identity service for mobile and web apps
Amazon EC2 Container Registry (ECR) – managed Docker container registry
Amazon Elastic Compute Cloud (EC2), Amazon Elastic Transcoder – media transcoding
Amazon Glacier – data archiving and backup
Amazon Lightsail – easy-to-manage VPS
Amazon Mobile Analytics – analyze and track mobile usage data
Amazon Pinpoint – push notifications
Amazon Redshift – data warehousing service
Amazon Simple Workflow (Amazon SWF) – workflow service for developers
Amazon WorkDocs – document collaboration
AWS Batch – batch computing jobs
AWS CodeBuild – managed service that compiles code, tests and produces software packages
AWS CodeCommit – managed source control service
AWS CodeDeploy – managed code deployment service
AWS Data Pipeline – data workflow service
AWS Electric Beanstalk – automatic deployment, load balancing and scaling
AWS Key Management Service – centralized control of encryption keys
AWS Lambda – serverless computing
AWS OpsWorks – automated AWS config management
Stacks – AWS resource collection as a single unit
AWS Storage Gateway – hybrid cloud storage

What Lessons Can we Learn from This Outage?

This outage was not something unique to only cloud service providers. Having a service or set of services getting disrupted can easily happen even when hosted in your own data centers. An outage like this ends up having expansive impact is when a principal provider of services to numerous enterprises experiences disruptions in major services that they offer.

Another consideration into the enormous impact of outages is the way we are building software. We are no longer building everything by ourselves. We build essential components and use third-party capabilities when possible. With the breadth of services and products that Amazon offers, it is not outside the realm of possibility that there are entire companies whose software offerings are built and run entirely on AWS.

Our customers have applications running on Amazon’s infrastructure, and we all felt the effects of the outage right away. The graph below shows the co-relation between when failures were noticed in applications being monitored on our system and when Amazon had the outage.

This outage could have been caused by a hardware failure. It could have been caused by a software update that was not ready for prime time. Or as Amazon pointed out to us in this case, it was due to human error. We have all lived through experiences where human error has resulted in hardship to customers. What is imaginably happening at Amazon is the tightening of the processes around testing live services, what can be done, who can do it and what type of fail-safes are built into the tools used so that these large scale unintended outages can be avoided.

What can you do help yourself, even with this dire dependency on your cloud provider?

Test frequently and early in dev/test, staging and production to ensure that all the critical pathways through your applications and your services (your own and the ones from third parties) are working correctly.
Do performance acceptance before pushing your software to production because it is a lot more expensive to catch problems in production than it is in dev/test or staging.
Monitor services from your cloud providers so that you know right away when there is an issue.
Understand the policies and procedures that your cloud provider follows regarding how they deal with changes to their production systems. The maturity of their processes and controls and their recovery mechanisms should let you rest easy that an outage of this magnitude will result in a speedy recovery.

About the Author / Anand Sundaram

With nearly 25 years of experience building and growing several technology businesses, Anand Sundaram is Vice President of Products, AlertSite at SmartBear Software. He is the co-founder of three startups including RSW Software that was valued at more than $240 million with more than 160 employees. He has extensive background in software quality, security and post deployment monitoring having built two load testing products and invented the first post deployment deep transaction monitoring product.