Lessons Learned From the Salesforce Outage

Everyone has bad days. Every company has been through some kind of outage due to a buggy database deployment. Even the best of the best, with highly trained staff, world-class best practices and well-thought-out processes make mistakes. On May 17, Salesforce .com had a bad day.

What Happened?

The company deployed a faulty database change script that broke permission settings in production and gave users read and write access to restricted data. This opened the door for an unauthorized employee to steal or tamper with the company’s data. As a result, Salesforce needed to take large parts of its infrastructure down to find and properly fix the issue. The outage lasted 15 hours, 8 minutes. According to Gartner’s cost-of-downtime formula ($5,600/minute), this outage cost approximately $5 million. Plus, since so many companies rely on Salesforce, it was a very visible and embarrassing outage. (Just take a look at #SalesforceDown and #permissiongeddon on social media.)

Salesforce had to shut everything down because of the way databases work. It’s not as easy as pulling a single application. Who knows how many Salesforce employees worked like mad to take the whole database down, find the offending database script, and restore everything—all because of one change script. That’s not a fun way to spend a weekend.

What Does This Tell Us?

Historically, Salesforce customers have experienced very little disruption in service. On the day of the outage, many loyal customers were tweeting about how rock solid the service has been, and that’s impressive.

That being said, this outage should be a wake-up call for users to realize their dependency on their platform, which has become a more integral part of how we conduct business. I’ve heard anecdotes of entire offices being unable to complete work that Friday.

The customer reactions show that they clearly have their stuff together over at Salesforce. What this outage shows is less about any shortcomings of this company specifically; but rather that everyone has blind spots, no matter how robust the testing process is.

Lessons for IT professionals:

Don’t forget the database. Unfortunately, this problem is all too common. There are countless cases of a database change script being executed in production and causing unexpected issues. This indicates that the database is often the forgotten part of the software test and release cycle. Unless companies start to seriously consider the database when it comes to change management, things like this will continue to happen.

Automate everything. Manual efforts fail. Humans make mistakes. Companies such as Salesforce that handle sensitive customer data should not put it on the internet unless they have automated every aspect of the system, including the database. Companies need a robust DataOps process, one that includes production-like data and automates the validation database changes.

Lessons for end users:

Establish a continuity plan. Organizations that depend on systems such as Salesforce need to have a business continuity plan to continue operations if they go down. This includes syncing Salesforce tasks and calendar with Office365, Exchange or Google Apps. However, since Salesforce has become the system of record for most companies that use the technology, having a backup of the data will be very difficult—after all, the point of a company using the platform in the first place was to avoid hosting its own data.

Demand accountability. Additionally, users can and should demand more accountability from software vendors. This can take the form of demanding a refund, but also customers need to speak to their account manager to understand what steps Salesforce is doing to make sure this does not happen in the future. It’s 2019 and companies update software all the time without taking down their production systems at the end of a quarter. Moreover, companies also can demand that companies don’t change production systems, outside of necessary security issues, during business hours.

The bottom line is that, while the mistakes that led to the Salesforce outage are very costly and highly visible to customers, they are also entirely preventable.

— Robert Reeves

Tags: customer experiencedatabaseSalesforce outagesoftware updatesystem of record

5 years ago

Robert Reeves

As Datical's chief technical officer, Robert Reeves advocates for customers and provides technical architecture leadership. Prior to co-founding Datical, Robert was a director at the Austin Technology Incubator. At ATI, he provided real world entrepreneurial expertise to ATI member companies to aid in market validation, product development and fundraising efforts. Robert co-founded Phurnace Software in 2005. He invented and created the flagship product, Phurnace Deliver, which provides middleware infrastructure management to multiple Fortune 500 companies. As chief technology officer, he led technical evangelism efforts, product vision and large account technical sales efforts. After BMC Software acquired Phurnace in 2009, Robert served as chief architect and lead worldwide technical evangelist.

Valkey is Rapidly Overtaking Redis
Redis is taking it in the chops, as both maintainers and customers move to the…
Techstrong Research PulseMeter: Caching Transforms Application Performance
The Techstrong Research PulseMeter report underscores the critical role of database caching in supporting real-time…
Running MongoDB on AWS: A Practical Guide
Many organizations are opting to run MongoDB in the AWS cloud to gain improved scalability…