Tech Pros: Plan for Failure to Breed Success

Veteran tech pros will tell you: Failure does not always mean failing at something; it often means failing to do something. Failure can be unpredictable and accidental, but usually, a lack of preparedness causes the most significant incidents. Data centers go down, huge bursts of bot activity generate too many requests for a network to handle and every systems engineer knows that one day—probably sooner rather than later—AWS will go offline. Not preparing for the worst increases the likelihood of the outages, attacks, breaches and downtime that keep tech pros up at night.

While this spans every aspect of technology, recently there has been a surge in app development failure planning through something increasingly known as “plannable catastrophes.” This pragmatic approach to failure involves considering catastrophes during the application design stage—anticipating disasters on the front end so an app can still perform even under duress. Preparing for catastrophes in this capacity can decrease the amount of time an app is down and give companies the upper hand if an unprepared competitor is experiencing a similar issue simultaneously.

Plan for Failure

The first step in planning for failure involves designing a way for processes to fail gracefully—for example, by switching the processes to a degraded service option rather than failing. If one service goes offline, the rest of the application will continue to work. A real-world example of this is enabling users to log in and view documents even if the search process is offline. Another way to plan for failure involves building in request buffering, so the components servicing the requests do not become overloaded. This is designed to prevent possible bottlenecks, especially if there is a spike in requests.

The second step is to detect failure conditions quickly, so you know when to switch over to the degraded service option. Having full visibility is essential to developing early warning signals for system issues.

Business Considerations

Every time an application is down or running slowly, there is a direct impact on the business, either in lost productivity or lost revenue. It’s crucial to bring business context into monitoring performance metrics and to prioritize issues based on business impact, ultimately helping ensure that end users or customers have an optimal experience.

It’s also key to include endpoint monitoring tools to quickly see if all the functions are online and load times are within range. Monitoring spikes in load or processing times, as well as resource utilization, is also crucial. Together, this will allow you to detect failure conditions quickly, ultimately helping you know when to switch over to the degraded service option—hopefully via automation.

Creating a Holistic View of a Cloud Environment

How does planning for failure come to fruition in app development and deployment? The answer lies in creating a holistic view of a cloud environment. Implementing this holistic view from the start of the process involves knowing that an application is going to fail—not if, but when—and planning accordingly. Top best practices include:

Shift Left: Integrate operations and monitoring considerations in application design and system architecture, coding, testing and preproduction. This helps both development and operations teams identify issues early, cut down on troubleshooting app issues later and react quickly to keep applications up and running even during spikes in demand or service failures.
Adopt an application performance monitoring (APM) tool: Full-stack visibility, from the end user through to the applications and infrastructure, is essential to having the ability to spot the early warning signs of a potential service disruption. To create that view, you should select an APM tool that can monitor the performance of the application holistically and allow you to design meaningful performance metrics and alerts. Implementing a curated collection of easy-to-use tools that seamlessly provide full-stack performance management without the cost or complexity of traditional APM solutions is key.
Be pragmatic in app development: Leverage APM tools in the staging or dev environment to measure the effect of variables on performance before the app goes live. Use APM tools to set up baselines at each phase, creating valuable metrics to flag when issues are about to take place and to detect failure before an app goes completely offline. Understanding the app’s baseline activities and response time enables tech pros to identify if the app is performing at the level it was designed. Tech pros can leverage platform capabilities to meet and exceed initial goals.

Conclusion

Failure is inevitable. Tech pros must ensure end users continue to have the best possible experience in the application and that business goals and service level agreements (SLAs) are met even amid crises. Preparedness comes in the form of taking preventative measures to help avoid and mitigate failure and having a plan in place to get back online as quickly as possible if preventative measures fail.

Shifting left, building failure mitigation into the design phase and implementing an APM tool helps create a holistic view of the cloud environment and achieve these goals. Planning and building for failure help tech pros avoid and mitigate potentially detrimental issues, ultimately using failure to breed success.

— Melanie Achard