A survey of 1,000 IT operations, DevOps, site reliability engineering (SRE) and platform engineering professionals in the U.S. conducted by Transposit, a provider of an incident management platform, found more than two-thirds (67%) have seen an increase in the frequency of service incidents that have affected their customers over the past 12 months.
Announced at the Kubecon + CloudNative conference today, the survey found 62% of respondents have also seen an increase in the amount of time it takes to resolve incidents over the course of the last year, with 80% reporting it takes up to six hours on average to resolve incidents from the first alert to mitigating the issue.
Paradoxically, however, 71% of respondents also claim to have automated incident management to their satisfaction, with 59% having a defined incident management process in place. One-third (33%) said only 11-25% of their incident management tasks or workflows are automated.
Transposit CEO Divanny Lamas said despite that level of comfort it’s clear the number of incidents that need to be addressed are continuing to increase as IT environments become more complex. In fact, in the next 12 months, 72% expected their organization to expand its IT stack.
At the same time, costs per incident are rising, with organizations at risk of losing up to $499,999 per hour on average, according to 63% of respondents. Nearly half (47%) said downtime can cost anywhere from $100,000 to $2 million.
A full 85% of respondents are hopeful that generative artificial intelligence (AI) will help further streamline incident management processes, with 80% having already embraced AI to varying degrees. More than half (51%) feel AI is making their job better, with 65% using it to improve the accuracy and quality of data. Just over half (51%) reported faster time to incident resolution, with 50% using it to more quickly and easily identify the root cause of issues, potential threats and vulnerabilities. Just under half (48%) use it to automate repetitive tasks or processes, streamlining their operations effectively.
A full 90% of respondents said integrating generative AI capabilities into incident management tools or platforms decreased the time it takes to create new automations.
Overall, 43% reported current incident management processes are not effective or are only being used by some team members due to confusing documentation (41%), limited access to tools (40%) and reliance on institutional knowledge (40%). Well over a third (37%) reported that only select team members have a comprehensive understanding of the defined incident management process and that they are adhered to consistently.
A full 96% said they believed it would be beneficial if all the tools their organization used to manage incidents were integrated through one tool or platform, the survey finds.
Top challenges cited included not enough buy-in from leadership or management (57%), not enough sharing of knowledge (54%), inadequate documentation of institutional knowledge and existing processes (54%) and a lack of clarity about what to automate (52%).
Respondents reported they expected to embrace technologies including AI or tools that employ machine learning algorithms (60%), automation tools or applications (53%) and communication/collaboration tools or applications (48%).
In addition, over the past year, 62% of respondents have increased their focus on SRE practices, with 58% planning to increase adoption of platform engineering as a methodology for centralizing the management of DevOps workflows.
Ultimately, automation should reduce incident management stress for all concerned, said Lamas. The challenge, as always, is making it simpler to achieve that goal in a way that doesn’t rely on a complex mix of custom scripts that don’t scale, she added.
One way or another, it’s clear that when it comes to incident management, there is still plenty of room for improvement.