Lessons Learned From the Salesforce Outage

Everyone has bad days. Every company has been through some kind of outage due to a buggy database deployment. Even the best of the best, with highly trained staff, world-class best practices and well-thought-out processes make mistakes. On May 17, Salesforce .com had a bad day.

What Happened?

The company deployed a faulty database change script that broke permission settings in production and gave users read and write access to restricted data. This opened the door for an unauthorized employee to steal or tamper with the company’s data. As a result, Salesforce needed to take large parts of its infrastructure down to find and properly fix the issue. The outage lasted 15 hours, 8 minutes. According to Gartner’s cost-of-downtime formula ($5,600/minute), this outage cost approximately $5 million. Plus, since so many companies rely on Salesforce, it was a very visible and embarrassing outage. (Just take a look at #SalesforceDown and #permissiongeddon on social media.)

Salesforce had to shut everything down because of the way databases work. It’s not as easy as pulling a single application. Who knows how many Salesforce employees worked like mad to take the whole database down, find the offending database script, and restore everything—all because of one change script. That’s not a fun way to spend a weekend.

What Does This Tell Us?

Historically, Salesforce customers have experienced very little disruption in service. On the day of the outage, many loyal customers were tweeting about how rock solid the service has been, and that’s impressive.

That being said, this outage should be a wake-up call for users to realize their dependency on their platform, which has become a more integral part of how we conduct business. I’ve heard anecdotes of entire offices being unable to complete work that Friday.

The customer reactions show that they clearly have their stuff together over at Salesforce. What this outage shows is less about any shortcomings of this company specifically; but rather that everyone has blind spots, no matter how robust the testing process is.

Lessons for IT professionals:

Don’t forget the database. Unfortunately, this problem is all too common. There are countless cases of a database change script being executed in production and causing unexpected issues. This indicates that the database is often the forgotten part of the software test and release cycle. Unless companies start to seriously consider the database when it comes to change management, things like this will continue to happen.

Automate everything. Manual efforts fail. Humans make mistakes. Companies such as Salesforce that handle sensitive customer data should not put it on the internet unless they have automated every aspect of the system, including the database. Companies need a robust DataOps process, one that includes production-like data and automates the validation database changes.

Lessons for end users:

Establish a continuity plan. Organizations that depend on systems such as Salesforce need to have a business continuity plan to continue operations if they go down. This includes syncing Salesforce tasks and calendar with Office365, Exchange or Google Apps. However, since Salesforce has become the system of record for most companies that use the technology, having a backup of the data will be very difficult—after all, the point of a company using the platform in the first place was to avoid hosting its own data.

Demand accountability. Additionally, users can and should demand more accountability from software vendors. This can take the form of demanding a refund, but also customers need to speak to their account manager to understand what steps Salesforce is doing to make sure this does not happen in the future. It’s 2019 and companies update software all the time without taking down their production systems at the end of a quarter. Moreover, companies also can demand that companies don’t change production systems, outside of necessary security issues, during business hours.

The bottom line is that, while the mistakes that led to the Salesforce outage are very costly and highly visible to customers, they are also entirely preventable.

— Robert Reeves