AIOps Success Requires Synthetic Internet Telemetry Data

Any form of artificial intelligence (AI) is only as good as the data used to train it. If organizations expect to effectively apply AI to IT operations (ITOps) they need to collect as much telemetry data as possible.

IT teams frequently discover that their AIOps platform was been trained on a narrow base of telemetry data. It might have been collected from, for example, a DevOps platform that lacks complete visibility into the distributed computing environment where their application runs. In the absence of the synthetic telemetry data collected from an Internet performance monitoring (IPM) platform to collect that telemetry data, it’s simply not likely the machine learning algorithms that are at the core of any AIOps platform are going to surface the best recommendations for optimizing application experiences.

The challenge is the probabilistic nature of AI. The relevancy of the recommendations that surface is determined by the quality of the data that has been exposed to the AI model. Real user data, for example, may be sparse or non-existent. Of course, if telemetry data was never shared with the AI model in the first place, it’s extremely unlikely that the AI’s recommendations will to better application experiences.

Gaining Visibility Using Synthetic Data

IT teams need to be certain the data used to train the AI model reflects the production environments where applications are deployed. Otherwise, no matter how advanced an AI model may be, garbage in still equals garbage out. Before any IT team adopts an AIOps platform, it needs to know the data provenance of the underlying AI model. If the pool of AI training data is limited, the recommendations generated will be, too. IT teams are not going to put their faith in AIOps platforms that are advising them to take specific actions based on partial or incomplete data. Nor should they.

Instead, the teams will assume that every output needs to be verified before the next step in a process is allowed to proceed. After all, the only thing worse than being wrong when it comes to IT and AI is to be wrong at catastrophic scale. Of course, continuing to manage IT sequentially arguably defeats the purpose of investing in an AIOps platform that is supposed to manage tasks in parallel.

Given the dependency modern applications have on Internet services, any effort to apply AI to IT management that doesn’t include synthetic Internet telemetry data will lead to a suboptimal outcome. By including this type of telemetry, the insights being surfaced to DevOps teams will enable them to ensure key performance indicators (KPIs) are attained and maintained.

Multiple AI Models

There’s not likely to be one AI model to rule them all. In many cases, networking, security, and other IT service management (ITSM) platforms will have already applied AI to the telemetry data they collect in real time. The output from those AI models will then be shared with AIOps platforms to automate a series of tasks on an end-to-end basis that previously would require IT teams to orchestrate workflows across multiple islands of automation.

DevOps teams, as a result, need to evaluate the efficacy of what will soon be a network of AI models. There are so many, each of which is or will be designed to automate specific tasks, such as analyzing Internet traffic to identify the source of bottlenecks that might only intermittently impact an application. Armed with insights, it then becomes possible for the AIOps platform to consistently generate useful recommendations that DevOps teams can trust. Then they can let the tools automatically apply the suggestion to how, for example, Internet traffic should be rerouted to maintain service level objectives (SLOs).

Realizing the Promise of AI

As AIOps platforms improve, they can substantially reduce much of the toil that DevOps teams regularly encounter. Teams can spend weeks trying to determine the root cause of an issue that, once discovered, might take a few minutes to fix. The challenge is that the source of the issue often has little to do with anything the DevOps team has immediate control over, as is the case when, for example, latency created by an Internet service adversely impacts application performance. However, those insights should enable DevOps teams to send support requests that better pinpoint the exact source of an Internet service issue that their provider should be able to resolve faster. Just as importantly, the DevOps team can move onto issues that they have more direct control over.

Much of the stress that any DevOps team experiences stems from not knowing the real cause of issue that despite their best efforts continues to generate a series of ongoing alerts. AIOps promises to reduce that stress by making it simpler to first correlate causation and, secondly, automate remediation. That promise, however, will never be fully realized if the data being relied on to train the AI model doesn’t provide enough of the picture required to make a truly informed decision.