DevOps in the Cloud

Lessons Learned From the Salesforce Outage

Everyone has bad days. Every company has been through some kind of outage due to a buggy database deployment. Even the best of the best, with highly trained staff, world-class best practices and well-thought-out processes make mistakes. On May 17, Salesforce.com had a bad day.

What Happened?

The company deployed a faulty database change script that broke permission settings in production and gave users read and write access to restricted data. This opened the door for an unauthorized employee to steal or tamper with the company’s data. As a result, Salesforce needed to take large parts of its infrastructure down to find and properly fix the issue. The outage lasted 15 hours, 8 minutes. According to Gartner’s cost-of-downtime formula ($5,600/minute), this outage cost approximately $5 million. Plus, since so many companies rely on Salesforce, it was a very visible and embarrassing outage. (Just take a look at #SalesforceDown and #permissiongeddon on social media.)

Salesforce had to shut everything down because of the way databases work. It’s not as easy as pulling a single application. Who knows how many Salesforce employees worked like mad to take the whole database down, find the offending database script, and restore everything—all because of one change script. That’s not a fun way to spend a weekend.

What Does This Tell Us?

Historically, Salesforce customers have experienced very little disruption in service. On the day of the outage, many loyal customers were tweeting about how rock solid the service has been, and that’s impressive.

That being said, this outage should be a wake-up call for users to realize their dependency on their platform, which has become a more integral part of how we conduct business. I’ve heard anecdotes of entire offices being unable to complete work that Friday.

The customer reactions show that they clearly have their stuff together over at Salesforce. What this outage shows is less about any shortcomings of this company specifically; but rather that everyone has blind spots, no matter how robust the testing process is.

Lessons for IT professionals:

  • Don’t forget the database. Unfortunately, this problem is all too common. There are countless cases of a database change script being executed in production and causing unexpected issues. This indicates that the database is often the forgotten part of the software test and release cycle. Unless companies start to seriously consider the database when it comes to change management, things like this will continue to happen.
  • Automate everything. Manual efforts fail. Humans make mistakes. Companies such as Salesforce that handle sensitive customer data should not put it on the internet unless they have automated every aspect of the system, including the database. Companies need a robust DataOps process, one that includes production-like data and automates the validation database changes.

Lessons for end users:

  • Establish a continuity plan. Organizations that depend on systems such as Salesforce need to have a business continuity plan to continue operations if they go down. This includes syncing Salesforce tasks and calendar with Office365, Exchange or Google Apps. However, since Salesforce has become the system of record for most companies that use the technology, having a backup of the data will be very difficult—after all, the point of a company using the platform in the first place was to avoid hosting its own data.
  • Demand accountability. Additionally, users can and should demand more accountability from software vendors. This can take the form of demanding a refund, but also customers need to speak to their account manager to understand what steps Salesforce is doing to make sure this does not happen in the future. It’s 2019 and companies update software all the time without taking down their production systems at the end of a quarter. Moreover, companies also can demand that companies don’t change production systems, outside of necessary security issues, during business hours.

The bottom line is that, while the mistakes that led to the Salesforce outage are very costly and highly visible to customers, they are also entirely preventable.

Robert Reeves

Robert Reeves

As Datical's chief technical officer, Robert Reeves advocates for customers and provides technical architecture leadership. Prior to co-founding Datical, Robert was a director at the Austin Technology Incubator. At ATI, he provided real world entrepreneurial expertise to ATI member companies to aid in market validation, product development and fundraising efforts. Robert co-founded Phurnace Software in 2005. He invented and created the flagship product, Phurnace Deliver, which provides middleware infrastructure management to multiple Fortune 500 companies. As chief technology officer, he led technical evangelism efforts, product vision and large account technical sales efforts. After BMC Software acquired Phurnace in 2009, Robert served as chief architect and lead worldwide technical evangelist.

Recent Posts

Valkey is Rapidly Overtaking Redis

Redis is taking it in the chops, as both maintainers and customers move to the Valkey Redis fork.

8 hours ago

GitLab Adds AI Chat Interface to Increase DevOps Productivity

GitLab Duo Chat is a natural language interface which helps generate code, create tests and access code summarizations.

12 hours ago

The Role of AI in Securing Software and Data Supply Chains

Expect attacks on the open source software supply chain to accelerate, with attackers automating attacks in common open source software…

18 hours ago

Exploring Low/No-Code Platforms, GenAI, Copilots and Code Generators

The emergence of low/no-code platforms is challenging traditional notions of coding expertise. Gone are the days when coding was an…

2 days ago

Datadog DevSecOps Report Shines Spotlight on Java Security Issues

Datadog today published a State of DevSecOps report that finds 90% of Java services running in a production environment are…

2 days ago

OpenSSF warns of Open Source Social Engineering Threats

Linux dodged a bullet. If the XZ exploit had gone undiscovered for only a few more weeks, millions of Linux…

2 days ago