In the first and second parts of this three-part series, we looked at how organizations can effectively resolve incidents and the role of each person involved in the resolution process. In this final piece, we’ll look at what organizations must do post-incident to better understand what went wrong and learn from their response.
At the time of occurrence, an outage is every IT practitioner’s worst nightmare. However, after the dust settles, each incident presents an opportunity to learn as a team, make an impact on your processes and improve your product. This means that even after you’ve found a resolution, your work isn’t quite done.
Enter the Incident Post-Mortem
Once you’ve successfully fixed a major issue and the chaos of the firefight is winding down, it’s time to conduct an incident post-mortem with your team to investigate the problem and figure out what went wrong. Without a post-mortem, you fail to recognize what you’re doing right, where you could improve and, most importantly, how to avoid making the same mistakes in the future.
Here are my three tried and true steps for a successful post-mortem process:
Designate a Post-Mortem Owner
To kick off the post-mortem, the incident commander should assign an owner at the very end of an incident. This owner is responsible for scheduling the post-mortem meeting, which should take place within five business days of an incident. Owners are also responsible for populating the post-mortem document, investigating the incident, pulling in other people to assist and facilitating the post-mortem meeting. In cases where a public blog post is required, the owner will also be responsible for creating the content and putting it through necessary internal review cycles.
Populate the Post-Mortem Document and Conduct Analysis
Once an owner has been designated, the team should get to work updating the post-mortem document with available information, including the timeline of status changes and key actions taken by incident responders. Every participant in the incident response should be included in this page, so be sure to go back through all communication and add relevant parties. After you’ve updated the page with the full timeline of actions, you will need to build out details for each event. For each item in the timeline, identify a metric or a third-party page where the data came from. This could be anything from a link to a Datadog graph or Splunk search to a relevant Tweet, as long as it shows the data point you’re trying to illustrate.
With the timeline in place, you can begin analyzing the incident. Capture all available data about what happened, what caused the issue, how many customers were affected, what impact they experienced, etc., to determine the underlying cause of the incident. This information should be detailed, as it will dictate follow-up action tickets your team will use to prevent recurrence of the incident and improve the incident response process. It also will be examined during the post-mortem meeting, so the more context you can include in your analysis, the better.
In the case of severe incidents where external communication is required, this is also the time when the post-mortem owner and team should develop customer communications. Pro tip: Avoid using the word “outage” if possible in this communication. Many people see the word outage and assume that your entire system was down, when in reality that was likely not the case. Be precise here, as it is important for both internal learning and sharing within the larger IT community.
Here is a general template for what to include on the post-mortem page:
- Overview
- What Happened
- Root Cause (or Contributing Factors)
- Resolution
- Impact
- Responders Involved
- Timeline
- How’d We Do?
- Action Items
- Communication (Internal and External)
- Hold a Post-Mortem Debrief Meeting
The final step in the process, the incident post-mortem meeting, should include the incident commander, service owners, key engineers/responders and the customer liaison (for severe incidents). Use your post-mortem document as an agenda to help things run smoothly. The discussion should center around what happened, what the team could have done better and any followup actions that need to be taken. Any disagreements about the facts, analysis or recommended actions should be cleared up during this time, and everyone should walk away from the meeting with a better sense of problems impacting your service’s reliability and having learned more about the service involved.
The No. 1 rule of running an incident post-mortem is to keep it blameless. The goal of the debriefing process is not to point fingers, but to learn what happened and how you can improve as a team.
Don’t make the mistake of neglecting the post-mortem process after a major incident. A well-designed, blameless post-mortem empowers your team to continuously learn and offers a way to iteratively improve your infrastructure and incident response process. Not only does this help your team, but it also helps create amazing software and services.