Blogs

What Does AIOps Mean for SREs?

It seems SREs are of two minds when it comes to AIOps. On one hand, AIOps’ potential is pretty exciting. By automating complex workflows and troubleshooting processes, AIOps could make your life as an SRE much easier.

But on the other hand, some SREs may choose to view AIOps with disdain and distrust. They might think of AIOps as just another trendy buzzword that doesn’t live up to the hype and that can become a distraction from the SRE tools that really matter.

Which perspective is right? Should SREs embrace AIOps with open arms, or should they resist marketers’ efforts to position AIOps as the latest, greatest tooling innovation in the IT industry?

Those are subjective questions that we can’t answer definitively, but let’s at least gain some perspective by examining what AIOps means for SREs.

What is AIOps?

If you keep up with IT buzzwords, you’ve probably at least heard the term AIOps—which is short for artificial intelligence for IT operations—and that it refers to the use of AI and machine learning (ML) to help automate IT operations workflows.

The big idea behind AIOps is that, by using AI and ML to perform advanced analysis of large volumes of data from IT systems, IT and SRE teams can solve complex problems more efficiently than they could using a traditional, manual approach.

AIOps can, for example, help to surface the root cause of a performance issue in a complex, multilayered environment like Kubernetes. Or it could make recommendations about how best to resolve an incident.

AIOps entered the IT lexicon in 2016, when Gartner coined the term. Now, six years later, it’s a relatively well-established tool domain.

How SREs View AIOps

Despite the fact that AIOps has been around for some time, it doesn’t yet appear that many SREs have bought into the AIOps revolution. Catchpoint found in a 2021 survey that just 7.5% of SREs reported that AIOps tools delivered high value to their organizations.

It’s unclear exactly why there are such low rates of excitement about AIOps, but we’d speculate that there are a few key factors at play:

  • AIOps is a new term for an old idea: Many monitoring and observability tools have included at least basic AI and ML analytics features for a long time, even before tool vendors slapped the AIOps label on their products. SREs probably realize this and, to some extent, view AIOps as an effort by marketers to rebrand functionality that is not actually new.
  • AIOps is hard to implement: Setting up an AIOps tool requires integrating it with diverse data sources and customizing it to fit your workflows and environment. It’s possible some SREs view this setup work as requiring more effort than it’s worth.
  • AIOps can’t replace human insight: While AIOps tools may be useful to a point, it would be unwise to place blind trust in AIOps-based analyses or recommendations. For this reason, some SREs may believe that AIOps encourages organizations to rely too heavily on automated tools at the cost of the expert analysis and perspective that only SREs can provide. (This is similar to the fact that it sometimes makes sense to prioritize human intuition and expertise over playbooks.)

From an SRE’s perspective, then, AIOps may appear over-hyped, overly complicated and underperforming compared to traditional approaches to SRE.

What SREs can Gain From AIOps

SREs’ wariness toward AIOps is valid—but only to a point. It’s important not to let suspicions about the limitations of AIOps turn into excuses not to use AIOps at all. AIOps has some value to offer to SREs, even if it’s not perfect.

For example, AIOps can play a role in reducing toil. To the extent that AIOps tools can recognize complex patterns or discover interrelated data sets more quickly than human engineers, AIOps reduces the time SREs have to spend manually troubleshooting problems or poring over complicated information.

AIOps also helps to enable a more proactive approach to monitoring and incident management. If AIOps tools can alert SREs to emerging issues before they would otherwise recognize them, AIOps can help the SREs to get in front of the problems before they turn into true incidents. That’s better for SREs and end users alike.

There is also an argument to be made that AIOps can help SREs do more with fewer engineering resources. If you can use AI to automate some aspects of monitoring and incident response, you can maintain the same levels of availability and performance with fewer human engineers on hand.

Conclusion: AI Won’t Replace SREs—But it can Help

None of the above is to say that AIOps can replace SREs, or that it magically solves every problem SREs face. Anyone who believes AIOps is a silver bullet has bought into the marketing hype to an unhealthy degree.

Nonetheless, AIOps tools do offer value to SREs by making their jobs easier in some respects and improving reliability outcomes.

So, while it’s wise to maintain a healthy perspective about the limitations of AIOps, SREs shouldn’t rule out AIOps tools as one way to improve reliability engineering.

JJ Tang

JJ is the co-founder of Rootly (YC S21), a Slack-native incident management solution. He is based in Toronto, Canada and previously lead product at Instacart and IBM. He is obsessed with developer productivity, F1, and his adopted dog.

Recent Posts

Building an Open Source Observability Platform

By investing in open source frameworks and LGTM tools, SRE teams can effectively monitor their apps and gain insights into…

20 hours ago

To Devin or Not to Devin?

Cognition Labs' Devin is creating a lot of buzz in the industry, but John Willis urges organizations to proceed with…

21 hours ago

Survey Surfaces Substantial Platform Engineering Gains

While most app developers work for organizations that have platform teams, there isn't much consistency regarding where that team reports.

2 days ago

EP 43: DevOps Building Blocks Part 6 – Day 2 DevOps, Operations and SRE

Day Two DevOps is a phase in the SDLC that focuses on enhancing, optimizing and continuously improving the software development…

2 days ago

Survey Surfaces Lack of Significant Observability Progress

A global survey of 500 IT professionals suggests organizations are not making a lot of progress in their ability to…

2 days ago

EP 42: DevOps Building Blocks Part 5: Flow, Bottlenecks and Continuous Improvement

In part five of this series, hosts Alan Shimel and Mitch Ashley are joined by Bryan Cole (Tricentis), Ixchel Ruiz…

2 days ago