In today’s fast-paced digital landscape, keeping your services up and performant, especially for cloud-based and online services, is crucial for staying competitive. Downtime or latency issues can drive customers away, especially when all it takes is a click to switch over to a competing SaaS solution. DevOps and site reliability engineering (SRE) teams face the constant challenge of minimizing mean-time-to-remediation (MTTR) when dealing with errors and issues. As valuable as search engines like Google are, the sheer amount of information available can make finding the right solutions a time-consuming task. What if there was a way to automate and streamline this process, making error investigation more intelligent, focused and efficient?
This article will walk you through the journey of error resolution from the first obscure log line to identifying the problem hidden within it. We’ll resolve the issue using several tactics, ultimately leading to the use of generative AI to cut MTTR and provide precise IT recommendations. This journey includes useful measures and principles that can be applied by DevOps teams and explores the potential for applying generative AI.
The Current Experience of DevOps/SRE
Picture this: A log line pops up with an obscure error message, and the first instinct is to search for this message on Google for potential solutions. After all, someone else must have encountered a similar problem, right? However, the abundance of resources and search results can be overwhelming. The search results are often displayed based on website relevance, not necessarily the relevance of the error itself. Consequently, precious time is wasted sifting through countless search results, leading to a longer time to understand and remediate the error. For DevOps and SRE teams responsible for maintaining system stability, reducing MTTR is a crucial key performance indicator (KPI). This begs the question: How can we leverage automation and artificial intelligence to improve the search process and accelerate error resolution?
Our First Step: Cognitive Insights
In our organization’s initial attempt to tackle this challenge, we focused on leveraging crowdsourcing techniques to get more relevant results than the brute-force Google search. Moreover, we wanted to automate the process and run it offline so that when an incident occurs, we can offer useful insights rather than starting the search as the system is down or malfunctioning.
The approach involves an offline phase and an online phase. The offline phase involved analyzing all our ingested logs and identifying common log patterns. This also allowed us to count how often this pattern occurred in our system and how prevalent it is. Then we crawled relevant technology forums like StackOverflow, Google Groups and Bing Groups for relevant discourse around these log patterns that could offer relevant solutions. The next step was to rank the search results by relevance and take the top five most relevant links.
The offline phase resulted in a library of known log patterns and, for each one, a cognitive insight containing these links, as well as additional information such as the severity level, how many times this pattern occurred in the logs, the date of first occurrence and additional tags of involved technologies, tools and domains.
The online phase occurs in real-time as new logs come in. Upon ingesting the log, it is automatically matched against all known patterns; if it matches one, it gets the cognitive insight for that pattern. This means that as soon as the problematic log comes in, the DevOps engineer already has focused and ranked search results and context to start from, which accelerates the investigation process.
The Next Step: Why Don’t we ask Generative AI?
After reflecting on our initial approach, we had an epiphany. Large language models (LLMs) like ChatGPT had already crawled the web and absorbed a vast amount of information. So, why not leverage their capabilities and ask them directly for insights? The idea was simple: Let the AI “read the posts” for us and provide us with recommendations. We started by formulating specific questions such as “What could the possible errors be?” and even went a step further by asking for further investigation steps. However, implementing this seemingly straightforward approach came with its own set of challenges. We needed to run preprocessing before asking the generative AI, as well as running post-processing on the answers it returned, to get what we expected. Let’s see what that process entails.
How we Did It: Analyzing, Sanitizing and Validating
Before asking ChatGPT (or any other generative AI tool) for assistance, several preparatory steps were necessary. First, we had to analyze incoming logs, identify patterns and score them based on severity. This allowed us to prioritize the most critical issues.
Secondly, we had to carefully design our prompts for the generative AI using prompt engineering, ensuring that we framed our questions in a way that yielded precise and relevant responses. Prompt engineering is the practice of strategically designing effective prompts or instructions for language models to achieve more accurate responses. It involves carefully specifying the input format, context or constraints to guide the model’s generation process and obtain more accurate and desired results. This is true for any use of generative AI, and for our case in particular. Beyond accuracy, prompt engineering helped us tune the requested format and length of the answer to fit the “executive summary” paragraph we popped up for the user. This ensured it wasn’t too lengthy nor too short to provide meaningful insights.
Additionally, we took great care in sanitizing the queries and removing any sensitive data to maintain privacy and security. This step was especially crucial for those working with public services, highlighting the importance of protecting users’ personally identifiable information (PII), such as names, emails, phone numbers, IP addresses or GPS coordinates.
With all this preprocessing done, we sent the prompt to the generative AI. It is important to note that public services may not always be available (as some returned a “service is busy” message when working with ChatGPT and other services), so automation should take care of delays and employ retries as needed.
Finally, as responses were received from the AI, it was essential to validate those. We ran an analysis and filtered out non-relevant answers from different domains (after all, a log is also a product of a tree, and Kafka is also a novelist), ensuring there was no offensive content or misleading information. The semantic integrity of the responses was also carefully evaluated.
These AI insights have proven to be a powerful troubleshooting tool, and we’ve made it an integral part of our service.
This journey is not unique. In fact, DevOps teams can employ similar measures for their own telemetry data to serve their own operational workflows. These principles, such as data sanitization and validation, even apply to other types of data and use cases.