What can you expect when investing in artificial intelligence for IT operations (AIOps)? Real-time visibility across huge volumes of information. Lightning-fast event correlation and anomaly detection. Automated remediation and self-healing, without Ops personnel having to lift a finger.
It all sounds amazing—at least, if you believe the marketing hype. But do those kinds of AIOps benefits actually translate to the real world? According to a recent survey of nearly 300 site reliability engineers (SREs), not so much.
Among SREs surveyed—including engineers for some of the world’s largest, most innovative digital companies—just 12% use AIOps as a day-to-day part of their monitoring toolkit. Nearly 40% never use it at all.
Why aren’t current AIOps solutions living up to the hype? And what can this gap tell us about the state of IT monitoring in general? Let’s take a closer look.
Sprawling Complexity
In theory, AIOps should be a perfect fit for SREs. These are, after all, the people with one foot in the world of operations, one in development, whose job is to make distributed software systems ultra-scalable and reliable. A solution that can turn mountains of data about those systems into actionable insights should be exactly what SREs need. And, in fact, 53% of SREs cite lack of unified visibility across the stack as their biggest cloud application monitoring challenge.
And yet, if we gauge by utilization, today’s AIOps tools don’t seem to be delivering on that promise. Asked about the value of AIOps, just 7.5% of SREs, on average, view AIOps as “high-value”—even less than those SREs (9%) who perceive its value as low.
Why is AIOps so slow to catch on? Ultimately, the barriers facing these tools are the same as those facing human engineers: massive and growing complexity in IT environments. As digital products become more dependent on third-party cloud services, as the number of things businesses want to track grows (from infrastructure to application to experience), the sheer volume, velocity and variety of monitoring data has exploded.
Diverse Data Sources
Compounding the problem, enterprises increasingly rely on multiple “same-service” providers for IT services. That is, they use multiple cloud providers, multiple DNS providers, multiple API providers, etc. There are sound business reasons for doing so, such as adding resiliency and drawing on different vendors’ strengths in different areas. But even when two providers are doing basically the same thing, they use different interfaces and instrumentation, and their data sources often employ different metrics, data structures, and taxonomies.
Whether you’re asking a human being or an AI-driven tool to solve this problem, this heterogeneity makes it extremely difficult to visualize the complete picture across the infrastructure. It also creates gray areas around how best to take advantage of each vendor’s different rules and toolsets. All of which makes automating anything beyond the most superficial operations tasks hugely challenging, if it’s possible at all.
Optimizing AIOps
AIOps really is the solution to this problem—or at least, an important part of the solution. Machine learning is, after all, tailor-made to analyze large volumes of observational and engagement data and react in real-time, as well as to continually identify opportunities for improvement. But before we can get there, we need to change the way we think about AIOps.
Don’t try to do everything at once.
It would be great if we could deploy AIOps, let it learn about our environment and watch it magically start remediating issues on its own. But we’re not there yet. (And if a vendor tries to convince you we are, you should definitely be skeptical.) If we break down our expectations into smaller, more manageable chunks, however, AIOps can accomplish quite a lot.
Event correlation, topology analytics and smart alerting, for example, are all areas where AIOps can provide significant value right now. For these more limited objectives, AIOps can absolutely reduce tedious manual effort for human operators, reduce mean time to repair, improve visibility and more, and digital businesses should be leaning on AIOps to do that.
Invest in AI and ML training.
Part of the reason adoption of AIOps tools remains sluggish stems from the fact that many teams still just don’t fully understand how they can use them. The more SREs are exposed to AI/ML training, the more opportunities they will identify for concrete, meaningful applications of AIOps. For example, only 27% of SREs said, “Experimenting or receiving training to expand knowledge or skills” was a major activity as part of their job.
Embrace “Platform Ops.”
The biggest step we can take is to rethink how we use AIOps within an organization. Today, AIOps tools—like SRE resources themselves—are often dedicated to a specific digital product or service. As a result, enterprises end up trying to apply their AIOps tools, and the SREs using them, to support everything associated with that product (DNS, API, CDN, cloud monitoring, etc). As IT environments continue to get even more complex, though, businesses will likely find that shifting to shared platform tools and teams yields better results.
Instead of trying to apply AIOps across every part of a digital service, imagine applying it to one part of that service (for instance, focusing on just DNS, or just APIs). In this way, the AIOps tools, as well as the teams using them, gain in-depth expertise across all the different interfaces and instruments used for that function and get continually better at normalizing multi-provider data sources across them. At that point, this platform-specific AIOps capability can then be scaled and reused across the company—providing more advanced DNS AIOps (or API, or CDN, etc.) for multiple digital products and teams.
Now, you’ve got an advanced, highly specialized toolset for optimizing an important platform function, regardless of which digital product happens to be using it. You’ve narrowed the scope of how you’re using a given AIOps implementation while going much deeper into that function than a more generic, broad-based implementation could achieve. All of a sudden, even more advanced automation and self-healing becomes possible. And the loftiest marketing claims of AIOps start to look less like hype and more like viable real-world outcomes for your business.