DevOps Practice

AIOps: The Path to Greater Resilience and Uptime

Large organizations with complex IT infrastructures are facing a significant challenge when it comes to data. The complexity is such that many IT operations teams are struggling to detect, or predict, the complex failure conditions and circumstances that can adversely impact resilience and uptime.

Monitoring a multitude of systems is a complex task, as the data generated by any system can be vast and far too overwhelming for manual IT operations processes, particularly in relation to analyzing, correlating and prioritizing alerts and remediation.

Companies have much to gain from a better understanding of the data being created from monitoring tools and practices: increasing the resilience of systems, and the speed of remediation cycles delivers significant cost and time savings. Harnessing advanced technologies to deliver more intelligence to teams in charge of performance monitoring is becoming increasingly important, which is where AIOps can deliver significant value on top of existing monitoring and automation tools.

Cutting Through the Noise

Telemetry tools deliver staggering amounts of data. The Microsoft Azure telemetry platform, for example, records 10 petabytes of data per day. When dealing with this volume, the ability to intelligently decipher the severity of alerts becomes vital, especially as many of them may be duplicates or not meaningful in any way. Yet it is IT operations teams that are tasked with understanding how systems are performing, and where resources should be diverted to ensure services are always up and running. To do this effectively, the root causes of alerts must be identified quickly and correctly.

AIOps is increasingly becoming the go-to solution to the problems inherent in dealing with big data. By applying data collection, data modeling and data analytics techniques and using machine learning algorithms to establish patterns, the noise of the data is reduced as operations teams gain more intelligent insights into alerts. So, providing more context to issues flagged within systems is the goal. Eventually, somebody has to take action when an alert is created, so the more intelligence engineers have, the easier it will be to remediate.

The less context delivered with alerts, the more time engineers will need to resolve the issue, as the manual remediation process will likely involve interaction with other teams. Take application performance monitoring as an example. In my experience, operations teams do not have an end-to-end understanding of the applications that are live in the production environment. If an alert is created, the ops team will typically seek assistance from a member of the developer team who built it, or another colleague with a deeper understanding of the application, which of course extends the lead time of the remediation process.

Key Considerations for Deploying AIOps

By establishing data patterns through machine learning and other advanced analytics technologies, the following key drivers for alerts assessment using AIOps can be established:

Frequency: Discovering how regularly alerts are created and whether it warrants automation for analysis, correlation, prioritization and remediation.

Coverage: Telemetry is of course vital to AIOps, as the machine learning algorithms will only be as good as the data they ingest. Putting in place the right telemetry tools is therefore vital to ensure the right data is being gathered. Failing to do so increases the risk of missing critical issues.

Impact: Not all problems require equal energy and resources. Certain issues might only occur every few months and may not be critical. In such situations, it is not prudent to invest significant amounts of time and money into developing the machine learning algorithms that will identify where and when an alert will be created. Again, cutting through the noise of the telemetry will help to identify high-priority alerts.

Probability: What is the likelihood of certain issues recurring? Can automation tools, informed by the machine learning algorithms, be implemented to deal with these, i.e. learn where and when they are likely to occur so they can be dealt with by the AIOps platform autonomously?

Take, for example, the issue of CPU or memory going beyond a certain threshold. Rather than manually contacting customers when an issue is flagged to inform them that services are about to go down before being spun up again, an AIOps platform can automate this process, so remediation happens seamlessly.

Latency: Automation of the remediation process for an alert deemed critical requires increased investment in AIOps. Getting good data from the early stages of AIOps processes will reduce the amount of time taken to both flag and deal with a problem. Containerized microservices, for example, take very little time to recover, so very little investment is needed when applying AIOps tools that automate remediation. Database recovery, however, is a more complex process, the automation of which would require significant investment.

What’s the Customer Value?

For service providers, AIOps is essential if they are to attain the highest levels of resilience, the measure of which all comes down to the Service Level Agreement (SLA). The SLA is measured by a percentage that corresponds to the time systems and are up and running over a year. So, 99.99% uptime is good, 99.999% is great, but 99.9999% is where service providers want to be. To put this into context, downtime for an entire year in the latter scenario amounts to just 32 seconds.

Increased automation is the only way to achieve optimal SLA. To do this, service providers are increasingly turning to AIOps solutions.

Operations teams are in charge of the IT strategies that align with business objectives. For service providers, a key part of this is, of course, ensuring maximum availability and optimal performance of the customer-facing environment. Understanding when and where faults appear, and improving resilience against them, is the best way to ensure this. That is where AIOps truly adds value.

Prashant Jain

Prashant Jain

With more than 20 years’ experience in business IT, Prashant is a seasoned product development, architecture and engineering leader. With a focus on cloud-based digital transformation and API management for financial services, Prashant leverages Safe Agile, DevOps, AI and Machine Learning practices to implement new processes and tools into the product development lifecycle.

Recent Posts

Building an Open Source Observability Platform

By investing in open source frameworks and LGTM tools, SRE teams can effectively monitor their apps and gain insights into…

10 hours ago

To Devin or Not to Devin?

Cognition Labs' Devin is creating a lot of buzz in the industry, but John Willis urges organizations to proceed with…

11 hours ago

Survey Surfaces Substantial Platform Engineering Gains

While most app developers work for organizations that have platform teams, there isn't much consistency regarding where that team reports.

1 day ago

EP 43: DevOps Building Blocks Part 6 – Day 2 DevOps, Operations and SRE

Day Two DevOps is a phase in the SDLC that focuses on enhancing, optimizing and continuously improving the software development…

1 day ago

Survey Surfaces Lack of Significant Observability Progress

A global survey of 500 IT professionals suggests organizations are not making a lot of progress in their ability to…

1 day ago

EP 42: DevOps Building Blocks Part 5: Flow, Bottlenecks and Continuous Improvement

In part five of this series, hosts Alan Shimel and Mitch Ashley are joined by Bryan Cole (Tricentis), Ixchel Ruiz…

1 day ago