SREs Say There's Plenty of Room to Improve Incident Management

A global survey of 423 site reliability engineers (SREs) found diagnosing issues is the most difficult aspect of incident management for more than half of respondents (53%).

Conducted by Catchpoint, a provider of a network monitoring platform, the survey found SREs respond to hundreds of ticketed (84%) and non-ticketed incidents (71%) per month. However, 42% also admitted their organization is not spending as much time as they should learning from major incidents, the survey found. Nearly half (47%) said learning from incidents has the most room for improvement in overall incident management activities. More than half of respondents (55%) are included in on-call rotations, but only 42% of SREs actually lead post-incident efforts.

Leo Vasiliou, director of product marketing for Catchpoint, said the survey makes it clear that SREs still don’t have enough visibility into their IT environments. In fact, 64% of respondents said they should be monitoring endpoints that, despite being beyond their control, could impact application environments.

Overall, two-thirds of respondents (66%) are using between two to five monitoring or observability tools to manage their IT environments. The most widely tracked metrics are uptime/availability (78%), followed by performance/response times (71%), latency (64%) and error rates (64%). A full 81% of respondents had two or more types of telemetry feeding their observability frameworks, with 43% having four or more.

Just under a quarter (24%), however, also acknowledged their organization breached a contractual service level agreement (SLA) in the last 12 months.

On the plus side, just over half (53%) anticipated advances in artificial intelligence (AI) would make their work easier, with only 4% believing AI will replace them.

There is no doubt organizations of all sizes are more dependent on IT than ever, but as application environments become increasingly complex, the probability an IT issue will disrupt business processes only increases. Hopefully, AI will soon make it simpler to respond to incidents by, for example, providing summarizations of incidents that make it easier to onboard new members to the response team as an incident escalates.

In the meantime, Murphy’s Law will continue to dictate that if something can go wrong, it will. At this point, it’s not possible to prevent every incident, so the focus needs to be on discovery and containment. The only way to achieve that goal is to provide IT teams with the tools needed to mitigate issues before they become catastrophic events. Achieving that goal requires having an incident management plan in place that enables the IT teams to respond adroitly versus, for example, wasting time trying to figure out who on the IT team might be needed to address it.

Clearly, SREs can play a bigger role in incident management. Most of their focus is rightly on preventing issues from occurring in the first place. After all, an ounce of prevention is still worth more than a pound of cure. The challenge, as always, is finding the time needed to prevent issues from occurring when so much of it is being consumed resolving incidents that, unfortunately, often continue to occur at a pace that is too fast and furious for IT teams to manage effectively.