DevOps Practice

When IT Disaster Strikes, Part 3: Conducting a Blameless Post-Mortem

In the first and second parts of this three-part series, we looked at how organizations can effectively resolve incidents and the role of each person involved in the resolution process. In this final piece, we’ll look at what organizations must do post-incident to better understand what went wrong and learn from their response.

At the time of occurrence, an outage is every IT practitioner’s worst nightmare. However, after the dust settles, each incident presents an opportunity to learn as a team, make an impact on your processes and improve your product. This means that even after you’ve found a resolution, your work isn’t quite done.

Enter the Incident Post-Mortem

Once you’ve successfully fixed a major issue and the chaos of the firefight is winding down, it’s time to conduct an incident post-mortem with your team to investigate the problem and figure out what went wrong. Without a post-mortem, you fail to recognize what you’re doing right, where you could improve and, most importantly, how to avoid making the same mistakes in the future.

Here are my three tried and true steps for a successful post-mortem process:

Designate a Post-Mortem Owner

To kick off the post-mortem, the incident commander should assign an owner at the very end of an incident. This owner is responsible for scheduling the post-mortem meeting, which should take place within five business days of an incident. Owners are also responsible for populating the post-mortem document, investigating the incident, pulling in other people to assist and facilitating the post-mortem meeting. In cases where a public blog post is required, the owner will also be responsible for creating the content and putting it through necessary internal review cycles.

Populate the Post-Mortem Document and Conduct Analysis

Once an owner has been designated, the team should get to work updating the post-mortem document with available information, including the timeline of status changes and key actions taken by incident responders. Every participant in the incident response should be included in this page, so be sure to go back through all communication and add relevant parties. After you’ve updated the page with the full timeline of actions, you will need to build out details for each event. For each item in the timeline, identify a metric or a third-party page where the data came from. This could be anything from a link to a Datadog graph or Splunk search to a relevant Tweet, as long as it shows the data point you’re trying to illustrate.

With the timeline in place, you can begin analyzing the incident. Capture all available data about what happened, what caused the issue, how many customers were affected, what impact they experienced, etc., to determine the underlying cause of the incident. This information should be detailed, as it will dictate follow-up action tickets your team will use to prevent recurrence of the incident and improve the incident response process. It also will be examined during the post-mortem meeting, so the more context you can include in your analysis, the better.

In the case of severe incidents where external communication is required, this is also the time when the post-mortem owner and team should develop customer communications. Pro tip: Avoid using the word “outage” if possible in this communication. Many people see the word outage and assume that your entire system was down, when in reality that was likely not the case. Be precise here, as it is important for both internal learning and sharing within the larger IT community.

Here is a general template for what to include on the post-mortem page:

  • Overview
  • What Happened
  • Root Cause (or Contributing Factors)
  • Resolution
  • Impact
  • Responders Involved
  • Timeline
  • How’d We Do?
  • Action Items
  • Communication (Internal and External)
  • Hold a Post-Mortem Debrief Meeting

The final step in the process, the incident post-mortem meeting, should include the incident commander, service owners, key engineers/responders and the customer liaison (for severe incidents). Use your post-mortem document as an agenda to help things run smoothly. The discussion should center around what happened, what the team could have done better and any followup actions that need to be taken. Any disagreements about the facts, analysis or recommended actions should be cleared up during this time, and everyone should walk away from the meeting with a better sense of problems impacting your service’s reliability and having learned more about the service involved.

The No. 1 rule of running an incident post-mortem is to keep it blameless. The goal of the debriefing process is not to point fingers, but to learn what happened and how you can improve as a team.

Don’t make the mistake of neglecting the post-mortem process after a major incident. A well-designed, blameless post-mortem empowers your team to continuously learn and offers a way to iteratively improve your infrastructure and incident response process. Not only does this help your team, but it also helps create amazing software and services.

Eric Sigler

Eric Sigler

Eric Sigler is the Head of DevOps at PagerDuty, helping protect its customers from the pains of downtime. Before his current role, Eric led infrastructure teams at Minted, Expensify, and the Missouri University of Science and Technology. Connect with him on Twitter.

Recent Posts

Copado Applies Generative AI to Salesforce Application Testing

Copado's genAI tool automates testing in Salesforce software-as-a-service (SaaS) application environments.

3 days ago

IBM Confirms: It’s Buying HashiCorp

Everyone knew HashiCorp was attempting to find a buyer. Few suspected it would be IBM.

4 days ago

Embrace Adds Support for OpenTelemetry to Instrument Mobile Applications

Embrace revealed today it is adding support for open source OpenTelemetry agent software to its software development kits (SDKs) that…

4 days ago

Paying Your Dues

TANSTAAFL, ya know?

4 days ago

AIOps Success Requires Synthetic Internet Telemetry Data

The data used to train AI models needs to reflect the production environments where applications are deployed.

6 days ago

Five Great DevOps Jobs Opportunities

Looking for a DevOps job? Look at these openings at NBC Universal, BAE, UBS, and other companies with three-letter abbreviations.

6 days ago