In Part One of this series, I provided a high-level overview of what an efficient incident response process entails from start to finish, including general framework for an incident response team. Now let’s delve a bit deeper into the role each person plays.
In my experience, there are several core people who make up an effective incident response team. Regardless of the severity of an issue, it’s all about coming together, working through the problem and finding a solution quickly before the incident has a severe impact on your end users. Below is a breakdown of the most critical roles.
Incident Response Structure
The Incident Commander
The Incident Commander (IC) runs the entire incident response process and acts as the single source of truth for what’s currently happening—as well as what’s going to happen—during a major incident. His or her primary responsibility is to drive the incident toward resolution.
To start, the IC is responsible for helping teams prepare by establishing agreed-upon communication channels and training team members on best practices and procedures for communicating. When an incident occurs, the IC has to move quickly to jump-start the resolution process, getting everyone on the same communication channel and collecting pertinent information from each team member. After getting a sense of the situation and the plan of action to make the necessary repairs, the IC must delegate all repair actions and continue to stay abreast of updates to drive the team toward resolution and be a reliable authority on the system status for all stakeholders. Throughout the process, the IC is not responsible for taking repair actions; rather, the IC ensures they are assigned to the appropriate people.
Following an incident, the IC assigns the post-mortem to an appropriate owner to help determine what went wrong and how it can be avoided in the future, as well as areas of improvement for the incident response team. This process includes creating a template immediately following the incident for team members to add their thoughts while still fresh, assigning the post-mortem after the event is over, and working with team leads to schedule preventive actions.
So what makes a good IC? At a minimum, ICs should possess the following traits and skills: excellent verbal and written communication; high-level knowledge of how different services interact with each other (and their business impact); and the ability to assess the effectiveness of various tactics/strategies, make rapid decisions on courses of action and modify plans on the fly as necessary. The final, and possibly most important, characteristic ICs should have is gravitas. A good IC is not afraid to take command and get things done. He or she must be willing to kick people off a call to remove distractions—even if it’s an executive.
The Deputy is a direct support role for the IC. Deputies are expected to perform important tasks during an incident and should be trained as an IC as they may have to take over command at a moment’s notice. The deputy is expected to support the IC in a variety of ways, including bringing up issues the IC may not have noticed, such as keeping an eye on timers that have been started or circling back around on missed items. Deputies also act as a “hot standby” IC. Being an IC can be a cognitively draining role, and it’s good to “cycle out” during longer running incidents.
Deputies’ responsibilities can vary, ranging from paging the right on-call engineers to managing the incident communication channels. They also can be in charge of contacting stakeholders to provide status updates from the IC. Deputies should be deeply familiar with incident response protocol, and possess organizational skills that keep things on track to allow the IC to focus on the incident.
Scribes are responsible for documenting the timeline of an incident as it progresses, making sure all important decisions and data are captured for later review. While the IC focuses on the problem at hand, it is critical for the scribe to capture an accurate timeline of events as they happen. This timeline will later be reviewed and analyzed during the post-mortem to determine how well the incident response team performed, and any additional impact that might not have been noticed at the time of the incident.
In addition to recording any oral communication regarding the incident, the scribe must keep track of important data, events and actions as they happen via whatever channel of communication has been designated by the IC (Slack, email, etc.). Specific details the scribe should note include key actions as they are taken, status reports when one is provided by the IC and any key callouts during the call or at the ending review.
Subject Matter Experts
Subject Matter Experts (SMEs), sometimes called “resolvers,” are domain experts or designated owners of a component or service that is part of an organization’s software stack. When there is a problem with a service, an expert is needed to be able to quickly help the IC and deputy identify and fix issues.
Typically the primary-on call at the time, the SME should be able to diagnose common problems with a given service, rapidly fix the issues found during an incident and concisely communicate the following in a report:
Condition: What is the current state of the service? Is it healthy or not?
Actions: What actions need to be taken if the service is not in a healthy state?
Needs: What support does the resolver need to perform an action?
The Customer Liaison
Those of us in the technical weeds have a tendency to overlook the impact our work has on customers and our organization’s bottom line. The customer liaison is an important member of the team, as he or she is responsible for interacting with external stakeholders to keep them up to date on any IT incidents that may impact them. The customer liaison is typically a member of your customer support team and handles all outward communication, including one-on-one interactions, publicly facing updates on Twitter or StatusPage and the external message from the completed post-mortem. He or she also should relay any customer feedback to the IC throughout the incident response process.
At a minimum, these are the roles that are vital to swift incident resolution. In future articles, I will break down details of the specific training involved for each of these roles, as well as specific examples of how I’ve seen them work together to resolve IT issues.
About the Author / Eric Sigler
Eric Sigler is the Head of DevOps at PagerDuty, helping protect its customers from the pains of downtime. Before his current role, Eric led infrastructure teams at Minted, Expensify, and the Missouri University of Science and Technology. Connect with him on Twitter.