What happens when the tools and services you depend on to drive site reliability engineering turns out to be susceptible to reliability failures of their own? That’s the question teams at about 400 businesses presumably asked themselves in the wake of a major outage in Atlassian Cloud. The incident offers a number of insights for SREs about reliability risks within reliability management software itself—as well as how to work through complex outages efficiently and transparently.
What Caused the Atlassian Cloud Outage?
The outage, which began on April 4 and finally resolved April 18, affected about 400 Atlassian Cloud customer accounts. Atlassian Cloud is a hosted suite of popular Atlassian products including Jira and OpsGenie. The outage meant that affected customers could no longer access these tools or the data they managed in them.
According to Atlassian, the problem was triggered by a faulty application migration process. Engineers wrote a script to deactivate an obsolete version of an application. However, due to what Atlassian called a “communication gap” between teams, the script was written in such a way that it deactivated all Atlassian Cloud products, not just the obsolete application.
To make matters worse, the script was apparently configured to delete data permanently, rather than mark it for deletion, which was the intention. As a result, data in affected accounts was removed permanently from production environments.
The Good, the Bad and the Ugly
The Atlassian Cloud outage may not be the very worst type of incident imaginable—failures like Facebook’s 2021 outage were arguably worse because they affected more people and because service restoration was complicated by physical access issues—but it was still pretty bad. Production data was permanently deleted and hundreds of enterprise customers experienced total service disruptions that have lasted several days and counting.
Given the seriousness of the incident, it’s tempting to point fingers at Atlassian engineers for letting an incident like this happen in the first place. They seem to have written a script with some serious issues, then presumably deployed it without testing it first—which is exactly the opposite of what you might call an SRE best practice.
On the other hand, Atlassian deserves credit for responding to the incident efficiently and transparently. Although the company was silent at first, it ultimately shared details about what happened and why even though those details were a bit embarrassing to its engineers.
Crucially, Atlassian also had backups and failover environments in place, which it used to speed the recovery process. The major reason why the outage lasted so long, the company said, is that restoring data from backups to production requires integrating backup data for individual customers into storage that is shared by multiple customers, a tedious process that Atlassian apparently can’t perform automatically (or doesn’t want to, presumably because it would be too risky to automate).
Unfortunately for impacted customers, it does not appear that any fallback tools or services were made available while they waited for Atlassian to restore operations. We imagine this poses more than minor issues for teams that rely on tools like Jira to manage projects and OpsGenie to handle incidents. Perhaps those teams stood up alternative tools in the meantime – or perhaps they crossed their fingers, hoping their project and reliability management tools would come back online ASAP.
Takeaways for SREs From the Atlassian Outage
For SREs, then, the key takeaways from this incident would seem to be:
- Always perform dry runs of migration processes in testing environments before putting them into production. Presumably, if Atlassian engineers had tested their application migration script first, they would have noticed its flaws before it took out live customer environments.
- Back up, back up, back up—and make sure you have failover environments where you can rebuild failed services based on backups. While this outage is bad, it would be 100 times worse if Atlassian couldn’t restore service based on backups and data had been lost permanently.
- Ideally, each customer’s data should be stored separately. As we noted above, the fact that Atlassian used shared storage seems to have been a factor in delaying recovery. That said, it’s hard to fault Atlassian too much on this point; it’s not always practical to isolate data for each user due to the cost and administrative complexity of doing so.
- SRE teams would do well to think about how they’ll respond if their reliability management software itself goes offline. For example, it might be worth extracting and backing up data from your reliability management tools so you can still access it if your tool provider experiences an incident like this.
- Over-communicate with your customers often and early. In this case, there was initial radio silence that left customers in the dark. Most of this chatter eventually took place in public forums.
Conclusion
The Atlassian Cloud outage is notable both for its length and for the fact that, ironically, it took out software that teams use to help prevent these types of issues from happening at their own businesses.
The good news is that Atlassian had the necessary resources in place to restore service as quickly as possible. A shared data storage architecture led to slow recovery, which is unfortunate; again, it’s hard to blame Atlassian too much for not setting up dedicated storage for each customer.