Call it the GenAI gold rush…or land rush. Industries from healthcare to banking and beyond have been rushing to integrate generative AI (GenAI). And the observability space is no exception. There has been a ton of noise around how large language models (LLMs) are set to transform the observability market. The status quo is less than ideal. With alarms, warnings and a confusing mixture of signals coming in from monitoring software (whether commercial vendors or open-source stacks), site reliability engineers are overwhelmed and suffer from alert fatigue.
If you manage a modern, distributed production system, you might already be looking into ways LLMs can simplify your team’s work. Issue diagnosis and are likely to demand more time and energy from your SREs and DevOps pros in recent years — a result of the increasingly intricate web of interdependencies of software systems and infrastructure.
LLMs will undoubtedly play a role in this space. But as an industry, we are still learning where they fit best.
This blog specifically explores why models aren’t a strong fit for a critical element of production troubleshooting: Root cause analysis. Then it suggests how LLMs could be integrated to enhance your overall observability strategy.
Capturing Context — Text vs. Structured Data
LLMs excel at analyzing unstructured text. They learn from vast amounts of text-based training data to identify patterns and make predictions.
So, in theory, if you could supply an LLM with the right text inputs, it would be able to synthesize a vast amount of information about your environment to create high-quality insights. (An example: this microservice is failing — and it is likely because of issue X).
But now consider: What would the right text inputs look like to generate that kind of insight? Broadly, they would fall into the bucket of ‘context’:
- Relational Heuristics: A model of the connections between different layers of your environment (e.g., a particular microservice calls a specific set of APIs)
- Tribal Knowledge around previous cause/effect relationships (e.g., user activity always spikes around certain times of the year).
This type of context exists naturally in a graph representation of your environment. A graph (with nodes and edges) is a natural and useful way of representing structured relationships between different applications, API, network and infrastructure layers. It highlights dependencies and captures evolutions in the structure of your environment over time.
However, the process of converting structured, time-series data into meaningful text inputs for an LLM is far from trivial. It is the biggest bottleneck in using LLMs — transforming a map of your environment into specific and relevant training data that GenAI can use to generate insights beyond the generic (i.e., what would you get if you Google “why might a microservice fail?”).
For example, let’s imagine that you experienced an outage. (For simplicity, let’s assume this scenario: Overloaded web pods leading to service issues on the checkout page.) The non-obvious root cause here turned out to be a degradation of the cache service. But to get an LLM to reproduce this insight, you need to feed it exhaustive, explicit and up-to-date information on the relationship between your web pods and cache service, and the underlying network configuration governing the cache service.
For complex environments with a high degree of interdependence between APIs, applications, network and infrastructure layers, it is not practical to continuously convert all these highly structured relationships to text input in a way that an LLM can analyze for root cause analysis. (This is exactly the area where graph machine learning shines.)
Analyzing Real-Time Continuous Data Streams
Production systems generate continuous, structured time-series data streams that require real-time visibility during root cause analysis. By design, models such as GPT react to queries and cannot process dynamic, real-time network, system and application data. Furthermore, as your systems evolve, LLMs will always be a step behind, only aware of the version of your environment on which they were last trained.
For effective root cause analysis, SREs and DevOps professionals need clear, intuitive data visualizations such as environment topologies and impact maps alongside a history of recent changes and deployments. GenAI models interface primarily through chat, which makes it difficult to visualize their insights.
Explainability Gaps at Scale
Explainability matters in all domains, but the stakes are particularly high in root cause analysis.
If a model can’t properly explain its reasoning for suggesting a particular mitigation strategy (such as upgrading a specific database that is causing a payment service bottleneck), SREs can’t act with confidence. And they may be influenced to perform misguided actions that compound existing issues or create new ones.
The fact that LLMs perform ‘black box’ operations isn’t a problem in and of itself. (After all, any neural network involves explainability challenges due to the number of parameters and volume training data.)
However, a major explainability challenge arises from the fact that the underlying data — a structured, time-series graph of your environment — must be transformed into an unstructured textual format that LLMs can analyze.
Where a graph representation of your environment makes the relationship across layers visible, the text input of an LLM makes it almost impossible to reason about the output. This can make it particularly hard to debug potential model failures:
- Is the model analyzing the most up-to-date representation of your environment?
- Was inadequate context provided to generate an accurate recommendation?
- Why was a particular root cause identified across multiple ‘hops in the chain’?
In short, the additional layer required to make LLMs useful for root cause analysis — transforming and annotating graph input into textual training data — introduces explainability challenges that only compound at scale.
How to Incorporate LLMs — The Right Way
We have highlighted some of the shortcomings of LLMs for investigation and root cause analysis of production issues.
But it would be a mistake to discount GenAI for observability altogether. LLMs have a clear and powerful role to play in the troubleshooting process. Specifically, they can complement other forms of AI used in root cause analysis (such as graph machine learning) by providing an intuitive, flexible and shared user interface for investigation.
Using LLMs as the ‘interface layer’ during an incident investigation — a chatbot that enables truly conversational troubleshooting — offers various benefits:
- It makes complex insights accessible to different cross-functional members of the team (SREs, DevOps, developers) who may have multiple levels of familiarity with the technical environment
- It accelerates the troubleshooting process by enabling questions to be asked and answered iteratively — the way that SREs investigate issues in the real world
- It promotes alignment by providing a common language for issues and root causes across the team.
What does this look like in practice? Imagine if during a service degradation (like an outage on the checkout page for an e-commerce retailer), various teams had access to a chat-based interface for asking questions and exploring hypotheses — not generic answers, but tailored to their specific environments. For example, “Why might the web pods be overloaded? What are the potential root causes of our cache service overload?”
In short, a conversational interface for troubleshooting can make root cause insights understandable and actionable for the human teams tasked with investigation and remediation.
What’s Next? The Future of Intelligent Observability
There is no question that generic AI models such as GPT will have a groundbreaking impact on observability. As we have seen, LLMs can vastly accelerate the troubleshooting process by translating root cause insights for human SREs.
But truly intelligent observability is not as simple as bolting a chatbot onto an existing platform. It requires a complete system designed by domain experts to collect, structure and analyze data through the lens of end users and business impact. Generic AI models fall short in the specialized and demanding task of root cause analysis for distributed production systems.
At Senser, we have been building that system from the ground up since day one.
How Senser Helps
Senser’s zero-instrumental AIOps platform uses eBPF-based data collection to provide immediate, low-overhead visibility into your production environment.
Senser automatically creates a topology of your environment, dynamically mapping dependencies across layers (application, APIs, network, infrastructure) to provide critical context for troubleshooting. Our graph ML-based approach helps you quickly pinpoint the root cause of service issues in even the most complex environment.
Bringing together the power of LLMs (for conversational troubleshooting) with ML purpose-built for root cause analysis gives your team the best of both worlds — the right tools to vastly reduce mean time to detect (MTTD) and mean time to remediate (MTTR).