AIOps: The Path to Greater Resilience and Uptime

Large organizations with complex IT infrastructures are facing a significant challenge when it comes to data. The complexity is such that many IT operations teams are struggling to detect, or predict, the complex failure conditions and circumstances that can adversely impact resilience and uptime.

Monitoring a multitude of systems is a complex task, as the data generated by any system can be vast and far too overwhelming for manual IT operations processes, particularly in relation to analyzing, correlating and prioritizing alerts and remediation.

Companies have much to gain from a better understanding of the data being created from monitoring tools and practices: increasing the resilience of systems, and the speed of remediation cycles delivers significant cost and time savings. Harnessing advanced technologies to deliver more intelligence to teams in charge of performance monitoring is becoming increasingly important, which is where AIOps can deliver significant value on top of existing monitoring and automation tools.

Cutting Through the Noise

Telemetry tools deliver staggering amounts of data. The Microsoft Azure telemetry platform, for example, records 10 petabytes of data per day. When dealing with this volume, the ability to intelligently decipher the severity of alerts becomes vital, especially as many of them may be duplicates or not meaningful in any way. Yet it is IT operations teams that are tasked with understanding how systems are performing, and where resources should be diverted to ensure services are always up and running. To do this effectively, the root causes of alerts must be identified quickly and correctly.

AIOps is increasingly becoming the go-to solution to the problems inherent in dealing with big data. By applying data collection, data modeling and data analytics techniques and using machine learning algorithms to establish patterns, the noise of the data is reduced as operations teams gain more intelligent insights into alerts. So, providing more context to issues flagged within systems is the goal. Eventually, somebody has to take action when an alert is created, so the more intelligence engineers have, the easier it will be to remediate.

The less context delivered with alerts, the more time engineers will need to resolve the issue, as the manual remediation process will likely involve interaction with other teams. Take application performance monitoring as an example. In my experience, operations teams do not have an end-to-end understanding of the applications that are live in the production environment. If an alert is created, the ops team will typically seek assistance from a member of the developer team who built it, or another colleague with a deeper understanding of the application, which of course extends the lead time of the remediation process.

Key Considerations for Deploying AIOps

By establishing data patterns through machine learning and other advanced analytics technologies, the following key drivers for alerts assessment using AIOps can be established:

Frequency: Discovering how regularly alerts are created and whether it warrants automation for analysis, correlation, prioritization and remediation.

Coverage: Telemetry is of course vital to AIOps, as the machine learning algorithms will only be as good as the data they ingest. Putting in place the right telemetry tools is therefore vital to ensure the right data is being gathered. Failing to do so increases the risk of missing critical issues.

Impact: Not all problems require equal energy and resources. Certain issues might only occur every few months and may not be critical. In such situations, it is not prudent to invest significant amounts of time and money into developing the machine learning algorithms that will identify where and when an alert will be created. Again, cutting through the noise of the telemetry will help to identify high-priority alerts.

Probability: What is the likelihood of certain issues recurring? Can automation tools, informed by the machine learning algorithms, be implemented to deal with these, i.e. learn where and when they are likely to occur so they can be dealt with by the AIOps platform autonomously?

Take, for example, the issue of CPU or memory going beyond a certain threshold. Rather than manually contacting customers when an issue is flagged to inform them that services are about to go down before being spun up again, an AIOps platform can automate this process, so remediation happens seamlessly.

Latency: Automation of the remediation process for an alert deemed critical requires increased investment in AIOps. Getting good data from the early stages of AIOps processes will reduce the amount of time taken to both flag and deal with a problem. Containerized microservices, for example, take very little time to recover, so very little investment is needed when applying AIOps tools that automate remediation. Database recovery, however, is a more complex process, the automation of which would require significant investment.

What’s the Customer Value?

For service providers, AIOps is essential if they are to attain the highest levels of resilience, the measure of which all comes down to the Service Level Agreement (SLA). The SLA is measured by a percentage that corresponds to the time systems and are up and running over a year. So, 99.99% uptime is good, 99.999% is great, but 99.9999% is where service providers want to be. To put this into context, downtime for an entire year in the latter scenario amounts to just 32 seconds.

Increased automation is the only way to achieve optimal SLA. To do this, service providers are increasingly turning to AIOps solutions.

Operations teams are in charge of the IT strategies that align with business objectives. For service providers, a key part of this is, of course, ensuring maximum availability and optimal performance of the customer-facing environment. Understanding when and where faults appear, and improving resilience against them, is the best way to ensure this. That is where AIOps truly adds value.

— Prashant Jain