According to Information Technology Intelligence Consulting Research, 98 percent of organizations report that a single hour of downtime can cost upwards of $100,000. That doesn’t take into consideration the long-term effects on brand, potential loss of trust with existing customers and engineering time. This is why making sure high severity incidents are resolved quickly and safely is extremely important.
Here are three simple yet effective ways to minimize the impact of high severity incidents within your organization:
Establish an Incident Manager On-Call Role and Rotation
One of the most impactful ways to help deal with high severity incidents is by establishing an incident manager on-call (IMOC) role and rotation. The IMOC is responsible for resolving high-severity incidents in a safe and fast manner. IMOCs are responsible for the management of high severity incidents through their life cycle: detection, diagnosis, mitigation, prevention and closure. As such, the IMOC role provides everyone in the company with a single point of contact. Ultimately, the IMOC role leads to a significantly reduced mean time to resolution (MTTR) for SEVs.
Identify and Assess Your Critical Services
Does everyone in your company know what your top five most critical services are? Take a proactive approach to improving your overall reliability by making a determination and reliability assessment of these services a top priority. It is important to understand your system and recognize how critical service failures could result in high severity incidents. Reducing downtime is certainly a goal every engineering team should have, but the fact is, things inevitably go wrong.
Practice Chaos Engineering
Chaos engineering is the practice of performing thoughtful, planned experiments designed to reveal the weakness in our systems. While it may seem counterintuitive, like a flu shot it helps to be proactive now to prevent something more harmful in the future. By having a better understanding of your system’s weaknesses, you’ll be able to more effectively troubleshoot issues and minimize the impact on your customers. Chaos engineering will empower you to identify weaknesses before they become SEVs.
In Conclusion
Implementing an IMOC role and rotation, identifying and assessing your critical services and implementing chaos engineering are three ways you can minimize the impact of high severity incidents, keeping your business running smoothly even when things aren’t.
— Tammy Butow