Anatomy of an Outage: Our AWS AutoScaling Group "Helping" Hand Pushed us off the Cliff

Like many of you, our week was defined by the us-east-1 outage. When the alerts fired, we all piled into the virtual war room. The first thing we noticed was that we were flying blind. The AWS console was barely loading, and core services like CloudTrail were completely unreachable. We had no logs, no telemetry. All we could do was watch our dashboards and the EC2 instance list itself.

We quickly saw a terrifying pattern. Most of our services were degraded, but online. Their instance counts were stable.

But one of our most critical, customer-facing services was completely dark.

The Self-Inflicted Wound

When we finally got the EC2 console to load for that service, we saw a nightmare. The instance list was in a constant state of churn. We weren’t seeing “unhealthy” instances; we were seeing instances that never even had a chance to live. They were stuck in a pending state for ages, and then would flip to terminating.

The Problem was our Autoscaling Policy

The regional network issues were likely causing just enough lag or CPU pressure on our existing instances to trigger our scale-up policy. The Autoscaling Group (ASG) did exactly what we told it to: “We are under load! Add more instances!”

But here’s the fatal part: The broken us-east-1 control plane meant every single one of those “add instance” requests failed. The ASG would try to launch a new instance, it would get stuck in pending, and then the ASG would, after a timeout, terminate it and… try again.

It was a relentless, automated feedback loop of failure.

Our own automation was effectively DDOS-ing our service’s ability to stabilize. The services that survived were the ones where we hadn’t implemented a scaling policy.

The “Aha!” Moment: The Power of Pinning

That’s when it hit us. The solution wasn’t to “fix” anything. The solution was to do nothing.

If that service was supposed to be running on 10 instances, our immediate “break-glass” procedure should have been to pin the ASG.

We should have immediately set our configuration:

* min: 10

* max: 10

* desired: 10

By setting min, max, and desired to the same number, we would have instantly disabled the scaling policy. The ASG would have stopped trying to add new instances. Because it wasn’t trying to add, it wouldn’t be stuck in that pending/terminating loop. It wouldn’t have terminated anything.

It would have just… stopped.

Our service would have kept functioning on its existing 10 instances. They might have been slow. They might have been degraded. But they would have been running. We traded 100% uptime for 0% uptime, all because our automation was trying to “help.”

Our New Playbook: Automation is for Peacetime

This incident taught us a painful lesson: Automation is for application-level failures, not platform-level meltdowns. When the cloud itself is breaking, your automation’s assumptions are all wrong.

We are now building a “Red Button” script. The moment we confirm a major regional outage, our first action isn’t to fail over. It’s to run a script that identifies all critical ASGs and pins them, setting min, max, and desired to their current values.

In a total platform failure, you have to be the one to stop your own systems from “helping” themselves to death.