This time last year, as the Internal Revenue Service (IRS) geared up for its biggest day of the year, a firmware bug deep in its storage network was waiting to take down the system. When Tax Day arrived, the onslaught of last-minute filers triggered a systemwide outage that lasted 11 hours and forced the IRS to extend its filing deadline by a day.
The IRS has another chance to get it right next week, but its IT team is far from alone in having dropped the ball on its most important day of the year. In 2015, the government’s Healthcare.gov website was plagued by crashes on the day Obamacare launched. Last year, Amazon’s website crashed just as Prime Day kicked off, and a few months later, Walmart’s systems fumbled at the start of Black Friday.
These glitches can cost organizations millions, not to mention the hit to their reputations and inconvenience to customers. As Tax Day approaches, it’s worth looking at some steps any organization can take to avoid embarrassing or costly glitches when a big day arrives. These occasions are fraught with risk due to sudden traffic spikes or untested code.
Here are steps organizations can take to minimize the chance of disruption.
Have a Deployment Process You Trust
People often talk about a code freeze as a solution, but this shouldn’t be your first instinct. A code freeze ties your hands from making last minute changes, and let’s face it, few teams have their code buttoned down months ahead of time. Far better is to have a rock-solid deployment process that gives you confidence that what you’re shipping won’t break. Here are some elements it should include:
- Trusted testing gate before production. You and your team need confidence that you’re shipping code that works on the devices and browsers your customers use. There are many ways to do this but focus on these elements: speed of throughput needed, the level of risk you are willing to tolerate and the cost to your business of bugs making it to production.
- A totally automated process with no human bottlenecks. There’s nothing like trying to ship a fix to production, only to realize the engineer with permission to approve the release is on PTO.
- Proper environment escalation. Contrary to the above, never ship direct to production! Have a proper staging server that is a good enough replica of production, so you can test your application in conditions that match production.
- Good test data management. This is one of the areas where software companies fall down. Stale data, data that doesn’t have edges or nuances like in production and data that contains personally identifiable information (PII) are all very common culprits—this is one of the most difficult things to get right with testing.
Stress Test for Robustness
In high-traffic periods, new stuff breaks. Stress testing helps to ensure services will continue running under a heavy load, but you need a realistic approach grounded in actual user behavior. Can you handle 20,000 users, all trying to register at once when your new app goes live? What pathways will customers take through your website or application? What features have caused problems in the past? Identify common pathways and potential problem areas and stress-test the hell out of them.
Build a Culture of Quality
One reason big events are scary for engineers is the sudden emphasis on accountability and quality that was lacking before. Solving for quality at the eleventh hour is guaranteed to be less effective and efficient than building a culture of quality from the start. Much like most management, the only way to do this well is to create accountability for your teams and give ownership of quality directly to them. A few quick ideas: Add a functionality review to your code review step, provide tooling to your team so they can drive testing themselves on pre-production features and make them feel the pain of bugs getting into production by having an engineering support rotation.
Measurement and Metrics
Achieving this culture of quality depends partly on good measurement and metrics. Sadly, this is one of the weakest and least-defined areas of development today. Since it’s impossible to directly observe quality in production—since you can only know about the bugs that are caught—it’s more productive to create some measure for QA process coverage and tie that off to the business. One way to do this is to look at the differences between how your customers are using your application in production, versus how you are testing it pre-production. This can give you a lot of insight into the areas that are under-covered or over-covered and can be a useful way to measure whether incremental investments in QA will actually be ROI-positive.
If You Can’t Do Any of That … Fine, Resort to a Code Freeze
If you don’t have a bulletproof deployment process, a code freeze is a brute-force option. It ties your hands from making last-minute changes, but at least ensures you won’t add new risks without adequate time to discover them. The duration depends on how long it takes your team to surface production issues, based on past experience. If you don’t have that data, six to eight weeks may be a safe period. But really, if this is your main strategy for avoiding problems on game day, you’re probably in the wrong job.
Review Patch Management Policies
This one might have helped the IRS. Its storage vendor, IBM, had issued a patch for the bug that brought down the system months earlier, but the IRS didn’t apply it because it was part of a code bundle that didn’t meet its production testing requirements. A government report concluded that the IRS needs to vet those decisions more effectively and document the process. Bottom line: Have a structure and process in place that ensures you’re making informed, sensible decisions about whether to apply new patches.
Other issues contributed to the IRS meltdown: Its storage system had no automatic fail-over or built-in redundancies, creating a single point of failure. It was unfortunate that several factors conspired against it, but luck isn’t something you should depend on. Executing successfully for a big event isn’t achieved in a few short months beforehand. It involves building the right culture and practices to guide your development and operations year-round. This will greatly minimize risk and the work you need to do to be prepared.