Five Lessons Learned the Hard Way From the Global Blue Screen of Death

As a product management professional, I am trained to understand customer pain, especially for customers who have challenges with downtime due to poor software quality. Last Friday I experienced the problem firsthand though when I arrived at Heathrow Airport early, only to discover that the Microsoft outage had grounded flights across the world. My flight from London was only delayed a couple of hours, but my connecting flight was canceled completely. Since it was the last flight of the day to my hometown, I had to find a hotel room for the night, but all of the rooms within 30 miles of the airport were sold out. All of this is because of a bug in one small file not much larger than this article, impacting millions of people across the world.

Most software releases contain defects. I have never seen production code that didn’t have issues, especially enterprise-level applications. Findings that are released to production are usually cosmetic, or minor bugs that impact a seldom-used feature. Defects like the one that took down the airlines, however, are another situation entirely. Now it’s time to consider what lessons we can learn from this event.

Lesson one: Shift Right is NOT optional. Shift Right is the concept that you should test in production, after the release. You have to wonder if this was done in the case of last week’s outage. Even if testing was performed in the pre-release environments, you can’t be sure that your pre-release environments are the same as production. You must smoke-test in production at a minimum. Better yet, run regression tests regularly. Even when testing in production, testing a single process or component may not be enough. Which leads us to the next lesson.

Lesson two: Integration and end-to-end system testing is a must. Most enterprise-class applications today are built from a complex collection of interconnected services and systems. Each service in isolation can function precisely as expected, but when you test it together with all the other moving pieces, you may find that small delays here and there will break the end-to-end process. These timing-related issues are exacerbated by unusually high volumes, which brings us to the next lesson.

Lesson three: Performance and load testing are important too. Performance and load testing may not seem important under normal circumstances, but when recovering from an outage, there is typically a bow wave of users trying to get on with the delayed work. This could be considered a cascading effect, but the situation often rears its ugly head at month and quarter end for some businesses.

In the case of last week’s incident, a huge number of canceled flights led to an increase in volume on the airline reservation system. A system that may not have been impacted by the actual bug is impacted by an unusual level of activity. Load testing is not an easy job, and it may not be necessary to do it regularly, but you shouldn’t wait until disaster strikes to learn that a 20% above-average volume will render a system unusable.

Lesson four: You don’t have to release to everyone all at once. Two concepts would have helped prevent a global meltdown. The first is “dogfooding” aka eating your own dog food. This is the practice of using your own software before releasing it to others. That way if there are major issues, you find them fast and you contain the suffering in your own organization.

The second concept is called canary releases. Just as miners used to take canaries into the coal mines to warn of a deadly condition, so too can you release code to a small subset of external users for a short period to ensure that a release works outside of your organization. Once again, you contain the negative impact.

Lesson five: The cost of system outages has cascading impacts. It’s not just the lost productivity during the downtime but it’s also the domino effect. For example, downtime in the order entry system can cause challenges with the inventory management system which can delay shipments and impact the top line as well as the bottom line. Lost revenue and profit may cause shareholders and customers to vote with their feet.

Lost productivity is compounded by the need for overtime to catch up. For hourly workers, this is a direct cost to the company. For salaried employees, like developers and testers, this means a negative impact on their job satisfaction and time away from their families. Increased turnover in IT has a cost as well.

The decline in customer satisfaction is another indirect cost of poor quality that is not always tied directly back to an outage. Even if your competitors were impacted by the same incident, your customers may not know that. They were traveling with you that day.

The next time you are negotiating a budget for testing tools and resources, remember that there are not only significant downtime costs, but many indirect costs as well. Sometimes I think the cost of testing tools and personnel should be classified as insurance because it is. The fact that bugs very seldom lead to downtime means the pain is most often not acute. We live with it. Our customers live with it. Until one day it knocks you off your feet.

Don’t wait until the pain is unbearable. Invest in people, processes and tools to ensure that you will never be put in the position of explaining why a bug in one small file impacted the entire world.