DevOps.com

  • Latest
    • Articles
    • Features
    • Most Read
    • News
    • News Releases
  • Topics
    • AI
    • Continuous Delivery
    • Continuous Testing
    • Cloud
    • Culture
    • DataOps
    • DevSecOps
    • Enterprise DevOps
    • Leadership Suite
    • DevOps Practice
    • ROELBOB
    • DevOps Toolbox
    • IT as Code
  • Videos/Podcasts
    • Techstrong.tv Podcast
    • Techstrong.tv Video Podcast
    • Techstrong.tv - Twitch
    • DevOps Unbound
  • Webinars
    • Upcoming
    • On-Demand Webinars
  • Library
  • Events
    • Upcoming Events
    • On-Demand Events
  • Sponsored Content
  • Related Sites
    • Techstrong Group
    • Container Journal
    • Security Boulevard
    • Techstrong Research
    • DevOps Chat
    • DevOps Dozen
    • DevOps TV
    • Techstrong TV
    • Techstrong.tv Podcast
    • Techstrong.tv Video Podcast
    • Techstrong.tv - Twitch
  • Media Kit
  • About
  • Sponsor
  • AI
  • Cloud
  • Continuous Delivery
  • Continuous Testing
  • DataOps
  • DevSecOps
  • DevOps Onramp
  • Platform Engineering
  • Low-Code/No-Code
  • IT as Code
  • More
    • Application Performance Management/Monitoring
    • Culture
    • Enterprise DevOps
    • ROELBOB

Home » Blogs » Doin' DevOps » What SREs Can Learn From the Atlassian Outage of 2022

What SREs Can Learn From the Atlassian Outage of 2022

Avatar photoBy: Weihan Li on May 19, 2022 Leave a Comment

What happens when the tools and services you depend on to drive site reliability engineering turns out to be susceptible to reliability failures of their own? That’s the question teams at about 400 businesses presumably asked themselves in the wake of a major outage in Atlassian Cloud. The incident offers a number of insights for SREs about reliability risks within reliability management software itself—as well as how to work through complex outages efficiently and transparently.

What Caused the Atlassian Cloud Outage?

The outage, which began on April 4 and finally resolved April 18, affected about 400 Atlassian Cloud customer accounts. Atlassian Cloud is a hosted suite of popular Atlassian products including Jira and OpsGenie. The outage meant that affected customers could no longer access these tools or the data they managed in them.

TechStrong Con 2023Sponsorships Available

According to Atlassian, the problem was triggered by a faulty application migration process. Engineers wrote a script to deactivate an obsolete version of an application. However, due to what Atlassian called a “communication gap” between teams, the script was written in such a way that it deactivated all Atlassian Cloud products, not just the obsolete application.

To make matters worse, the script was apparently configured to delete data permanently, rather than mark it for deletion, which was the intention. As a result, data in affected accounts was removed permanently from production environments.

The Good, the Bad and the Ugly

The Atlassian Cloud outage may not be the very worst type of incident imaginable—failures like Facebook’s 2021 outage were arguably worse because they affected more people and because service restoration was complicated by physical access issues—but it was still pretty bad. Production data was permanently deleted and hundreds of enterprise customers experienced total service disruptions that have lasted several days and counting.

Given the seriousness of the incident, it’s tempting to point fingers at Atlassian engineers for letting an incident like this happen in the first place. They seem to have written a script with some serious issues, then presumably deployed it without testing it first—which is exactly the opposite of what you might call an SRE best practice.

On the other hand, Atlassian deserves credit for responding to the incident efficiently and transparently. Although the company was silent at first, it ultimately shared details about what happened and why even though those details were a bit embarrassing to its engineers.

Crucially, Atlassian also had backups and failover environments in place, which it used to speed the recovery process. The major reason why the outage lasted so long, the company said, is that restoring data from backups to production requires integrating backup data for individual customers into storage that is shared by multiple customers, a tedious process that Atlassian apparently can’t perform automatically (or doesn’t want to, presumably because it would be too risky to automate).

Unfortunately for impacted customers, it does not appear that any fallback tools or services were made available while they waited for Atlassian to restore operations. We imagine this poses more than minor issues for teams that rely on tools like Jira to manage projects and OpsGenie to handle incidents. Perhaps those teams stood up alternative tools in the meantime – or perhaps they crossed their fingers, hoping their project and reliability management tools would come back online ASAP. 

Takeaways for SREs From the Atlassian Outage

For SREs, then, the key takeaways from this incident would seem to be:

  • Always perform dry runs of migration processes in testing environments before putting them into production. Presumably, if Atlassian engineers had tested their application migration script first, they would have noticed its flaws before it took out live customer environments.
  • Back up, back up, back up—and make sure you have failover environments where you can rebuild failed services based on backups. While this outage is bad, it would be 100 times worse if Atlassian couldn’t restore service based on backups and data had been lost permanently.
  • Ideally, each customer’s data should be stored separately. As we noted above, the fact that Atlassian used shared storage seems to have been a factor in delaying recovery. That said, it’s hard to fault Atlassian too much on this point; it’s not always practical to isolate data for each user due to the cost and administrative complexity of doing so.
  • SRE teams would do well to think about how they’ll respond if their reliability management software itself goes offline. For example, it might be worth extracting and backing up data from your reliability management tools so you can still access it if your tool provider experiences an incident like this.
  • Over-communicate with your customers often and early. In this case, there was initial radio silence that left customers in the dark. Most of this chatter eventually took place in public forums.

Conclusion

The Atlassian Cloud outage is notable both for its length and for the fact that, ironically, it took out software that teams use to help prevent these types of issues from happening at their own businesses.

The good news is that Atlassian had the necessary resources in place to restore service as quickly as possible. A shared data storage architecture led to slow recovery, which is unfortunate; again, it’s hard to blame Atlassian too much for not setting up dedicated storage for each customer.

Recent Posts By Weihan Li
  • How SREs Benefit From Feature Flags
Avatar photo More from Weihan Li
Related Posts
  • What SREs Can Learn From the Atlassian Outage of 2022
  • Atlassian Acquires OpsGenie to Automate Incident Management
  • Why SREs Are Critical to DevOps
    Related Categories
  • Business of DevOps
  • DevOps Practice
  • Doin' DevOps
  • IT Security
    Related Topics
  • Atlassian
  • incident
  • incident management
  • outage
  • SRE
Show more
Show less

Filed Under: Business of DevOps, DevOps Practice, Doin' DevOps, IT Security Tagged With: Atlassian, incident, incident management, outage, SRE

« How to Get the Supply Chain Back to (Better than) Normal
Managing Hardcoded Secrets to Shrink Your Attack Surface  »

Techstrong TV – Live

Click full-screen to enable volume control
Watch latest episodes and shows

Upcoming Webinars

Moving Beyond SBOMs to Secure the Software Supply Chain
Tuesday, January 31, 2023 - 11:00 am EST
Achieving Complete Visibility in IT Operations, Analytics, and Security
Wednesday, February 1, 2023 - 11:00 am EST
Achieving DevSecOps: Reducing AppSec Noise at Scale
Wednesday, February 1, 2023 - 1:00 pm EST

Sponsored Content

The Google Cloud DevOps Awards: Apply Now!

January 10, 2023 | Brenna Washington

Codenotary Extends Dynamic SBOM Reach to Serverless Computing Platforms

December 9, 2022 | Mike Vizard

Why a Low-Code Platform Should Have Pro-Code Capabilities

March 24, 2021 | Andrew Manby

AWS Well-Architected Framework Elevates Agility

December 17, 2020 | JT Giri

Practical Approaches to Long-Term Cloud-Native Security

December 5, 2019 | Chris Tozzi

Latest from DevOps.com

New Relic Bolsters Observability Platform
January 30, 2023 | Mike Vizard
Let the Machines Do It: AI-Directed Mobile App Testing
January 30, 2023 | Syed Hamid
Five Great DevOps Job Opportunities
January 30, 2023 | Mike Vizard
Stream Big, Think Bigger: Analyze Streaming Data at Scale
January 27, 2023 | Julia Brouillette
What’s Ahead for the Future of Data Streaming?
January 27, 2023 | Danica Fine

TSTV Podcast

On-Demand Webinars

DevOps.com Webinar ReplaysDevOps.com Webinar Replays

GET THE TOP STORIES OF THE WEEK

Most Read on DevOps.com

What DevOps Needs to Know About ChatGPT
January 24, 2023 | John Willis
Microsoft Outage Outrage: Was it BGP or DNS?
January 25, 2023 | Richi Jennings
Optimizing Cloud Costs for DevOps With AI-Assisted Orchestra...
January 24, 2023 | Marc Hornbeek
Dynatrace Survey Surfaces State of DevOps in the Enterprise
January 24, 2023 | Mike Vizard
Deploying a Service Mesh: Challenges and Solutions
January 24, 2023 | Gilad David Maayan
  • Home
  • About DevOps.com
  • Meet our Authors
  • Write for DevOps.com
  • Media Kit
  • Sponsor Info
  • Copyright
  • TOS
  • Privacy Policy

Powered by Techstrong Group, Inc.

© 2023 ·Techstrong Group, Inc.All rights reserved.