Microsoft Outage Outrage: Was it BGP or DNS?

Welcome to The Long View—where we peruse the news of the week and strip it to the essentials. Let’s work out what really matters.

This week: All of Microsoft’s cloud services go down, everywhere except China. Redmond’s IaaS, PaaS and SaaS—including GitHub—were dead for several hours, and are still running unreliably—despite Microsoft saying it’s fixed the problem.

Azure’s Single Point of Failure

Analysis: It’s DNS. It’s always DNS (unless it’s BGP)

It’s clear that some hapless Microserf b0rked the internal network with a configuration change. It appears the change didn’t immediately cause problems, but issues slowly rippled across the infrastructure. This has all the hallmarks of a dodgy DNS config or a broken BGP update.

What’s the story? Akriti Sharma reports—“Microsoft cloud outage hits users around the world”:

“Domino effect”
Microsoft [said] a networking outage took down its cloud platform Azure along with services such as Teams and Outlook used by millions around the globe. Azure’s status page showed services were impacted in Americas, Europe, Asia Pacific, Middle East and Africa.
…
Azure said most customers should have seen services resume after a full recovery of the Microsoft Wide Area Network. … Earlier, Microsoft said it had determined a network connectivity issue was [impacting] connectivity between clients on the internet to Azure, as well as connectivity between services in data centres.
…
Businesses have become increasingly dependent on online platforms. … An outage of Azure—which has 15 million corporate customers and over 500 million active users, according to Microsoft data—can impact multiple services and create a domino effect.

Is it still down? The lovely Ben Lovejoy brings joy—“Outage appears to be largely resolved”:

“Followed a bad day for the company”
A wide scale Microsoft outage had many users unable to access Outlook, Teams, Azure, and more. The company says it is rolling back a network change it believes to be responsible. … Microsoft was initially unable to see the reason for the outages, but later said that it had “isolated the problem to networking configuration issues.”
…
It followed a bad day for the company, in which it reported its slowest sales growth in six years. Just a week earlier, the company slashed 10,000 jobs.

Oof. oofio2461 alleges an allegation:

They just had a layoff, and this happens. What a coincidence.

Still, all fixed? Excellent news. Not so fast, says reset-password:

I have some Azure services that are not able to consistently make outbound HTTP requests to my heartbeat monitoring service so I’m getting alert after alert this morning. This is just the nudge I needed: … I’ll be moving the whole thing to Linode later this afternoon.

Should we turn to the Bird App to learn the truth? @deestillballin thinks not:

Good Morning. The fact that @Office365 has protected it’s tweets during this “Outlook Outage” lets me know me along with other IT professionals are in for a long day.

For example, u/DJ3XO:

A customer I am working with has their core firewall cluster … in Azure, where all IPsec tunnels are terminated against. Fun times.

At first it was holding on by a thread, then the network interfaces dropped as they didn’t receive their IPs from the gateway, and then 194+ tunnels dropped. I should have just stayed in bed today.

Speaking of a domino effect, here’s Captain Scarlet:

Yet again a morning of whinging at yet another single point of failure. … I couldn’t access anything (Windows Laptop and Android handset) thanks to the global corp using Microsoft Authenticator: … This wasn’t working either.
…
Because I couldn’t auth with Microsoft MFA, my VPN would connect but refuse to auth further. I couldn’t access the internet because the Proxy software used couldn’t authenticate.

Aside from China, this is a global outage. So sofixa says that’s unheard of:

Azure … is badly designed to such an extent that multiple times there have been global outages. … Azure availability, security (the only major cloud provider with not one but multiple cross-tenant security exploits) and usability are pretty terrible so it shouldn’t be used for anything but saying, “This is how it should not be done.”

GCP had a similar thing once, where a BGP update knocked out their Asian regions. AWS have never had a global outage. (And no, that time S3 in us-east-1 was down wasn’t a global outage, the only customer code/workloads that were impacted was code interacting with S3 that didn’t specify the region and had to rely on us-east-1 to determine it, and it didn’t work anymore).

Meanwhile, at least u/StConvolute is happy:

At work I’ve gone from being labeled as, “Old man who yells at clouds,” to, “The guy who saw it coming.”

The Moral of the Story:
Life imposes things on you that you can’t control, but you still have the choice of how you’re going to live through this

—Celine Dion

You have been reading The Long View by Richi Jennings. You can contact him at @RiCHi or tlv@richi.uk.

Image: Brian Smale/Microsoft (cc:by-sa; leveled and cropped)

INE Security Launches Enhanced eMAPT Certification

Seraphic Security Unveils BrowserTotal™ – Free AI-Powered Browser Security Assessment for Enterprises

INE Security Alert: $16.6 Billion in Cyber Losses Underscore Critical Need for Advanced Security Training

INE Security and RedTeam Hacker Academy Announce Partnership to Advance Cybersecurity Skills in the Middle East

INE Security Partners with Abadnet Institute for Cybersecurity Training Programs in Saudi Arabia

Sign up for our newsletter!Stay informed on the latest DevOps news

Azure’s Single Point of Failure

Analysis: It’s DNS. It’s always DNS (unless it’s BGP)

The Moral of the Story:Life imposes things on you that you can’t control, but you still have the choice of how you’re going to live through this

Sign up for our newsletter!
Stay informed on the latest DevOps news

The Moral of the Story:
Life imposes things on you that you can’t control, but you still have the choice of how you’re going to live through this