As a developer or operations team member, there is nothing quite like the dread you feel when you hear the familiar ringtone of your on-call page at 3 a.m. Being on call means that you may be contacted at any time to investigate and fix issues that arise for the system, but that doesn’t mean you can’t get ready beforehand. There are steps your team can take to prepare for incidents and streamline the process of resolving them, leading to fewer 3 a.m. wake-up calls and better-running software and services.
Below is an overview of a proven approach your organization can take before, during and after an IT incident to reduce headaches and increase hours of sleep.
Before: Define the Who, What, When of Incidents
The first step in any incident prevention and orchestration process is determining how your organization defines an incident. Generally, this is done by defining the varying degrees of severity, with lower-numbered severities more urgent. Operational issues can be classified at one of these severity levels. As a general rule of thumb, if you are unsure which level an incident is, treat it as more severe to ensure it is dealt with timely.
In addition to aligning your team on defining an incident to communicate the sense of urgency involved, there are several main roles that should be designated ahead of potential IT issues. Certain roles have only one person per incident, while other roles can have multiple people assigned to it. It’s all about coming together as a team, working the problem and getting a solution quickly. Generally speaking, developer, IT Ops and DevOps teams must designate the following roles:
- A point person or “Incident Commander”—someone who will drive the process forward, but not be involved in the actual remediation;
- Someone to document a timeline of an incident as it progresses for future analysis and learning and to act as a backup for the point person;
- One or more subject matter experts who are deeply familiar with the specified component or service and who will take the remediation steps; and
- One person to manage customer support and communication during and after an incident.
During: Keep Calm and Communicate
If you are alerted of a major incident, don’t panic. The first step is to join a previously agreed-upon method of communication to be used during incidents to ensure that communication can run smoothly throughout the resolution process. The incident commander (IC) and the IC deputy should announce the issue in appropriate communication channels and lead the process, and it is best to defer to the subject matter experts assigned to the incident to ensure non-essential communication is kept to a minimum.
The steps you take to prepare for an incident have a significant impact on how quickly you are able to move when disaster strikes. If incident prep has been done correctly, each member of the incident response team will have a very specific role and set of responsibilities carved out, ranging from someone to provide regular updates in the chat client to someone to modify your company’s status page to keep customers informed. By having these roles defined beforehand, you don’t have to spend valuable time during an incident figuring them out instead of fixing the problem. In a future article, I will break down each member of the incident response team’s specific role and the steps they should follow during an incident.
After: Conduct a Post-Mortem Without Finger-Pointing
For every major incident, you should follow up with a post-mortem—a blame-free, detailed description of exactly what went wrong to cause the incident, along with a list of steps to take to prevent a similar incident from occurring again in the future. The incident response process itself should also be included as part of the review.
As the IT incidents we deal with daily become increasingly tied to larger organizational success and business objectives, streamlining the resolution process is a must. According to a report from IDC, the average hourly cost of an infrastructure failure is $100,000 per hour, and the average cost of a critical application failure per hour is $500,000 to $1 million. In future articles, I will break down the varying roles on an incident response team, specific steps to follow during an incident, a tried-and-true template for conducting a successful post-mortem and more, to ensure effective incident prevention and resolution.
About the Author / Eric Sigler
Eric Sigler is the Head of DevOps at PagerDuty, helping protect its customers from the pains of downtime. Before his current role, Eric led infrastructure teams at Minted, Expensify, and the Missouri University of Science and Technology. Connect with him on Twitter.