Living in a data-rich world is a blessing and a curse. Flexible monitoring systems, open APIs, and easy data visualization resources make it simple to graph anything you want, but too much data quickly becomes noisy and un-actionable. At PagerDuty we’ve thought hard about what you should monitor and why from a systems perspective, but what about monitoring data on your operations performance? We’d like to share some specific metrics and guidelines that help teams measure and improve their operational performance.
Â
- Raw Incident Count
A spike or continuous upward trend in the number of incidents a team receives tells you two things: either that team’s infrastructure has a serious problem, or their monitoring tools are misconfigured and need adjustment.
Incident counts may rise as an organization grows, but real incidents per responder should stay constant or move downward as the organization identifies and fixes low-quality alerts, builds runbooks, automates common fixes, and becomes more operationally mature.
When looking at incidents, it’s important to break them down by team or service, and then drill into the underlying incidents to understand what is causing problems. Was that spike on Wednesday due to a failed deploy that caused issues across multiple teams, or just a flapping monitoring system on a low-severity service? Comparing incident counts across services and teams also helps to put your numbers in context, so you understand whether a particular incident load is better or worse than the organization average.
- Mean Time to Resolution (MTTR)
Time to resolution is the gold standard for operational readiness. When an incident occurs, how long does it take your team to fix it? Downtime not only hurts your revenue but also customer loyalty, so it’s critical to make sure your team can react quickly to all incidents.
While resolution time is important to track, it’s often hard to norm, and companies will see variances in TTR based on the complexity of their environment, the way teams and infrastructure responsibility are organized, industry, and other factors. However, standardized runbooks, infrastructure automation, reliable alerting and escalation policies will all help drive this number down.
- Time to Acknowledgement / Time to Response
This is the metric most teams forget about– the time it takes a team to acknowledge and start work on an incident.
While an incident responder may not always have control over the root cause of a particular incident, one factor they are 100% responsible for is their time to acknowledgement and response. Operationally mature teams have high expectations for their team members’ time to respond, and hold themselves accountable with internal targets on response time.
If you’re using an incident management system like PagerDuty, an escalation timeout is a great way of enforcing a response time target. For example, if you decide that all incidents should be responded to within 5 minutes, then set your timeout to 5 minutes to make sure the next person in line is alerted. To gauge the team’s performance, and determine whether your target needs to be adjusted, you can track the number of incidents that are escalated.
- Escalations
For most organizations using an incident management tool, an escalation is an exception – a sign that either a responder wasn’t able to get to an incident in time, or that he or she didn’t have the tools or skills to work on it. While escalation policies are a necessary and valuable part of incident management, teams should generally be trying to drive the number of escalations down over time.
There are some situations in which an escalation will be part of standard operating practice. For example, you might have a NOC, first-tier support team or even auto-remediation tool that triages or escalates incoming incidents based on their content. In this case, you’ll want to track what types of alerts should be escalated, and what normal numbers should look like for those alerts.
About the Author
David Shackelford is a product manager at PagerDuty, the leader in operations performance management. David works with teams across the company to plan, build, and ship features that improve operation teams’ quality of life, decrease time to incident resolution, and ultimately improve uptime for PagerDuty customers. Prior to PagerDuty, David worked in education technology, creating integrations between school information systems and digital content, and as a Teach for America corps member, teaching Mathematics in San Francisco public schools. You can follow David on LinkedIn: https://www.linkedin.com/in/dshackelford and Twitter: https://twitter.com/dshack