Artificial Intelligence: Coming to the Rescue of ITOps

According to McKinsey’s Global Institute Report of 2018, artificial intelligence (AI) has the potential to create an annual value of $3.5 billion to $5.8 billion across different industry sectors. Today, AI in finance and IT alone accounts for about $100 billion; hence, it is becoming quite the game changer in the IT world.

With the onset of cloud adoption, the world of IT DevOps has changed dramatically. The focus of ITOps is changing to an integrated, service-centric approach that maximizes business services availability. AI can help ITOps in early detection of outages, potential root cause prediction, finding systems and nodes that are susceptible to outages, average resolution time and more. This article highlights a few use cases where AI can be integrated with ITOps, simplifying day-to-day operations and making remediation more robust.

Predictive Analytics of Outages

False positives can cause threat alert fatigue for ITOps teams. One survey indicates that about 52 percent of security alerts are generally false positives. This puts a lot of pressure on the teams, as they have to review each of these alerts manually. In such a scenario, deep neural networks can predict whether an alert will result in outages.

Alerts Layers Yes/No

Feed Forward back propagation with two hidden layers should yield good results in terms of predicting outages as illustrated above. All alert types within a stipulated time can act as inputs and outages would be the output. Historical data should be used to train the model. Every enterprise has its own fault line and weakness, and it is only through historical data that latent features are surfaced; hence, every enterprise should build its own customized model, as a “one size fits all” model has a higher likelihood of not delivering expected outcomes.

The alternate method is a logistic regression where all “alert types” are input variables and “binary outages” would be the output.

Logistic regression measures the relationship between the categorical dependent variables and one or more independent variables by estimating probabilities using a logistic function, which is the cumulative logistic distribution. Thus, it treats the same set of problems as probit regression using similar techniques, with the latter using a cumulative normal distribution curve instead.

Root Cause Classification and Prediction

This is a two-step process. In the first step, root cause classification is done based on keyword search. From free-flow root cause analysis fields, natural language processing (NLP) is used to extract key values and classify into predefined root causes. This can be either supervised or unsupervised.

In the second step, random forest for multi-class neural networks can be used to predict root causes while other attributes act as input. Based on the data volume and the datatype, one can choose the right classification model. In general, random forest has better accuracy, but it needs structured data and right labeling and it is less fault-tolerant to data quality. While a multi-class neural network will need a large volume of data to train, it is more fault-tolerant but slightly less accurate.

Prediction of Average Time to Close a Ticket

A simple weighted average formula can be used to predict time taken for ticket resolution:

Avg time (t) = (a1.T1 + a2.T2+ a3.T3 )/(count of T1+T2+T3)

Where T1 are ticket types.

Other attributes can be used to segment the ticket into right cohorts to make it more predictable. This helps in better resource planning and utilization. Weightage of features can be done heuristically or empirically.

Unusual Load on System

Simple anomaly detection algorithms can inform whether the system is going through a normal load or it has high variance. A high variance/deviation from average on time series can inform the unusual activities or resources that are not freeing up. However, the algorithm should take care of seasonality, as a system load is a function of time and season.

Given the above scenarios, it is obvious that AI has a tremendous opportunity to serve IT operations. It can be used for several ITOps including prediction, event correlation, detection of unusual loads on system (e.g. cyberattack) and remediation based on root cause analysis.

— Vivek Singh