DevOps.com

  • Latest
    • Articles
    • Features
    • Most Read
    • News
    • News Releases
  • Topics
    • AI
    • Continuous Delivery
    • Continuous Testing
    • Cloud
    • Culture
    • DataOps
    • DevSecOps
    • Enterprise DevOps
    • Leadership Suite
    • DevOps Practice
    • ROELBOB
    • DevOps Toolbox
    • IT as Code
  • Videos/Podcasts
    • Techstrong.tv Podcast
    • Techstrong.tv Video Podcast
    • Techstrong.tv - Twitch
    • DevOps Unbound
  • Webinars
    • Upcoming
    • On-Demand Webinars
  • Library
  • Events
    • Upcoming Events
    • On-Demand Events
  • Sponsored Content
  • Related Sites
    • Techstrong Group
    • Container Journal
    • Security Boulevard
    • Techstrong Research
    • DevOps Chat
    • DevOps Dozen
    • DevOps TV
    • Techstrong TV
    • Techstrong.tv Podcast
    • Techstrong.tv Video Podcast
    • Techstrong.tv - Twitch
  • Media Kit
  • About
  • Sponsor
  • AI
  • Cloud
  • Continuous Delivery
  • Continuous Testing
  • DataOps
  • DevSecOps
  • DevOps Onramp
  • Platform Engineering
  • Low-Code/No-Code
  • IT as Code
  • More
    • Application Performance Management/Monitoring
    • Culture
    • Enterprise DevOps
    • ROELBOB
Hot Topics
  • Cisco Bets on OpenTelemetry to Advance Observability
  • 5 Technologies Powering Cloud Optimization
  • Platform Engineering: Creating a Paved Path to Reduce Developer Toil
  • Where Does Observability Stand Today, and Where is it Going Next?
  • Five Great DevOps Job Opportunities

Home » News » Microsoft Outage Outrage: Was it BGP or DNS?

Microsoft Outage Outrage: Was it BGP or DNS?

By: Richi Jennings on January 25, 2023 Leave a Comment

Welcome to The Long View—where we peruse the news of the week and strip it to the essentials. Let’s work out what really matters.

This week: All of Microsoft’s cloud services go down, everywhere except China. Redmond’s IaaS, PaaS and SaaS—including GitHub—were dead for several hours, and are still running unreliably—despite Microsoft saying it’s fixed the problem.

TechStrong Con 2023Sponsorships Available

Azure’s Single Point of Failure

Analysis: It’s DNS. It’s always DNS (unless it’s BGP)

It’s clear that some hapless Microserf b0rked the internal network with a configuration change. It appears the change didn’t immediately cause problems, but issues slowly rippled across the infrastructure. This has all the hallmarks of a dodgy DNS config or a broken BGP update.

What’s the story? Akriti Sharma reports—“Microsoft cloud outage hits users around the world”:

“Domino effect”
Microsoft [said] a networking outage took down its cloud platform Azure along with services such as Teams and Outlook used by millions around the globe. Azure’s status page showed services were impacted in Americas, Europe, Asia Pacific, Middle East and Africa.
…
Azure said most customers should have seen services resume after a full recovery of the Microsoft Wide Area Network. … Earlier, Microsoft said it had determined a network connectivity issue was [impacting] connectivity between clients on the internet to Azure, as well as connectivity between services in data centres.
…
Businesses have become increasingly dependent on online platforms. … An outage of Azure—which has 15 million corporate customers and over 500 million active users, according to Microsoft data—can impact multiple services and create a domino effect.

Is it still down? The lovely Ben Lovejoy brings joy—“Outage appears to be largely resolved”:

“Followed a bad day for the company”
A wide scale Microsoft outage had many users unable to access Outlook, Teams, Azure, and more. The company says it is rolling back a network change it believes to be responsible. … Microsoft was initially unable to see the reason for the outages, but later said that it had “isolated the problem to networking configuration issues.”
…
It followed a bad day for the company, in which it reported its slowest sales growth in six years. Just a week earlier, the company slashed 10,000 jobs.

Oof. oofio2461 alleges an allegation:

They just had a layoff, and this happens. What a coincidence.

Still, all fixed? Excellent news. Not so fast, says reset-password:

I have some Azure services that are not able to consistently make outbound HTTP requests to my heartbeat monitoring service so I’m getting alert after alert this morning. This is just the nudge I needed: … I’ll be moving the whole thing to Linode later this afternoon.

Should we turn to the Bird App to learn the truth? @deestillballin thinks not:

Good Morning. The fact that @Office365 has protected it’s tweets during this “Outlook Outage” lets me know me along with other IT professionals are in for a long day.

For example, u/DJ3XO:

A customer I am working with has their core firewall cluster … in Azure, where all IPsec tunnels are terminated against. Fun times.

At first it was holding on by a thread, then the network interfaces dropped as they didn’t receive their IPs from the gateway, and then 194+ tunnels dropped. I should have just stayed in bed today.

Speaking of a domino effect, here’s Captain Scarlet:

Yet again a morning of whinging at yet another single point of failure. … I couldn’t access anything (Windows Laptop and Android handset) thanks to the global corp using Microsoft Authenticator: … This wasn’t working either.
…
Because I couldn’t auth with Microsoft MFA, my VPN would connect but refuse to auth further. I couldn’t access the internet because the Proxy software used couldn’t authenticate.

Aside from China, this is a global outage. So sofixa says that’s unheard of:

Azure … is badly designed to such an extent that multiple times there have been global outages. … Azure availability, security (the only major cloud provider with not one but multiple cross-tenant security exploits) and usability are pretty terrible so it shouldn’t be used for anything but saying, “This is how it should not be done.”

GCP had a similar thing once, where a BGP update knocked out their Asian regions. AWS have never had a global outage. (And no, that time S3 in us-east-1 was down wasn’t a global outage, the only customer code/workloads that were impacted was code interacting with S3 that didn’t specify the region and had to rely on us-east-1 to determine it, and it didn’t work anymore).

Meanwhile, at least u/StConvolute is happy:

At work I’ve gone from being labeled as, “Old man who yells at clouds,” to, “The guy who saw it coming.”

The Moral of the Story:
Life imposes things on you that you can’t control, but you still have the choice of how you’re going to live through this

—Celine Dion

You have been reading The Long View by Richi Jennings. You can contact him at @RiCHi or [email protected].

Image: Brian Smale/Microsoft (cc:by-sa; leveled and cropped)

Recent Posts By Richi Jennings
  • OpenAI Hires 1,000 Low Wage Coders to Retrain Copilot | Netflix Blocks Password Sharing
  • 8-Bit Floating Point for AI/ML? | Amazon and Microsoft Shed Tech Jobs
  • FAA Ground Stop due to Technical Debt? | Don’t Do DIY Crypto!
More from Richi Jennings
Related Posts
  • Microsoft Outage Outrage: Was it BGP or DNS?
  • Microsoft’s DevOps Gambit and DevOps.com’s Business Directory
  • Tenable Network Security Supports Microsoft Azure, Securing Cloud Environments at the Speed of DevOps
    Related Categories
  • Application Performance Management/Monitoring
  • AWS Community Hub
  • AWS Community Hub Featured
  • Blogs
  • Business of DevOps
  • CI/CD
  • Cloud Management
  • CloudOps
  • Containers
  • Continuous Delivery
  • DevOps and Open Technologies
  • DevOps Culture
  • DevOps in the Cloud
  • DevOps Practice
  • DevOps Toolbox
  • Doin' DevOps
  • Editorial Calendar
  • Enterprise DevOps
  • Features
  • GitOps
  • Infrastructure/Networking
  • IT Administration
  • IT as Code
  • IT Help Desk
  • Most Read
  • News
  • Platform Engineering
  • SaaS
    Related Topics
  • azure
  • BGP
  • DNS
  • gitops
  • it-as-code
  • microsoft
  • Microsoft Azure
  • Microsoft Teams
  • MSFT
  • Office 365
  • outage
  • The Long View
Show more
Show less

Filed Under: Application Performance Management/Monitoring, AWS Community Hub, AWS Community Hub Featured, Blogs, Business of DevOps, CI/CD, Cloud Management, CloudOps, Containers, Continuous Delivery, DevOps and Open Technologies, DevOps Culture, DevOps in the Cloud, DevOps Practice, DevOps Toolbox, Doin' DevOps, Editorial Calendar, Enterprise DevOps, Features, GitOps, Infrastructure/Networking, IT Administration, IT as Code, IT Help Desk, Most Read, News, Platform Engineering, SaaS Tagged With: azure, BGP, DNS, gitops, it-as-code, microsoft, Microsoft Azure, Microsoft Teams, MSFT, Office 365, outage, The Long View

« The Database of the Future: Seven Key Principles
GitLab Strengthens Remote DevOps Management »

Techstrong TV – Live

Click full-screen to enable volume control
Watch latest episodes and shows

Upcoming Webinars

Automating Day 2 Operations: Best Practices and Outcomes
Tuesday, February 7, 2023 - 3:00 pm EST
Shipping Applications Faster With Kubernetes: Myth or Reality?
Wednesday, February 8, 2023 - 1:00 pm EST
Why Current Approaches To "Shift-Left" Are A DevOps Antipattern
Thursday, February 9, 2023 - 1:00 pm EST

Sponsored Content

The Google Cloud DevOps Awards: Apply Now!

January 10, 2023 | Brenna Washington

Codenotary Extends Dynamic SBOM Reach to Serverless Computing Platforms

December 9, 2022 | Mike Vizard

Why a Low-Code Platform Should Have Pro-Code Capabilities

March 24, 2021 | Andrew Manby

AWS Well-Architected Framework Elevates Agility

December 17, 2020 | JT Giri

Practical Approaches to Long-Term Cloud-Native Security

December 5, 2019 | Chris Tozzi

TSTV Podcast

On-Demand Webinars

DevOps.com Webinar ReplaysDevOps.com Webinar Replays

GET THE TOP STORIES OF THE WEEK

  • Home
  • About DevOps.com
  • Meet our Authors
  • Write for DevOps.com
  • Media Kit
  • Sponsor Info
  • Copyright
  • TOS
  • Privacy Policy

Powered by Techstrong Group, Inc.

© 2023 ·Techstrong Group, Inc.All rights reserved.