DevOps.com

  • Latest
    • Articles
    • Features
    • Most Read
    • News
    • News Releases
  • Topics
    • AI
    • Continuous Delivery
    • Continuous Testing
    • Cloud
    • Culture
    • DataOps
    • DevSecOps
    • Enterprise DevOps
    • Leadership Suite
    • DevOps Practice
    • ROELBOB
    • DevOps Toolbox
    • IT as Code
  • Videos/Podcasts
    • Techstrong.tv Podcast
    • Techstrong.tv - Twitch
    • DevOps Unbound
  • Webinars
    • Upcoming
    • On-Demand Webinars
  • Library
  • Events
    • Upcoming Events
    • On-Demand Events
  • Sponsored Content
  • Related Sites
    • Techstrong Group
    • Container Journal
    • Security Boulevard
    • Techstrong Research
    • DevOps Chat
    • DevOps Dozen
    • DevOps TV
    • Techstrong TV
    • Techstrong.tv Podcast
    • Techstrong.tv - Twitch
  • Media Kit
  • About
  • Sponsor
  • AI
  • Cloud
  • Continuous Delivery
  • Continuous Testing
  • DataOps
  • DevSecOps
  • DevOps Onramp
  • Platform Engineering
  • Low-Code/No-Code
  • IT as Code
  • More
    • Application Performance Management/Monitoring
    • Culture
    • Enterprise DevOps
    • ROELBOB
Hot Topics
  • 5 Unusual Ways to Improve Code Quality
  • Bug Bounty Vs. Crowdtesting Programs
  • Five Great DevOps Job Opportunities
  • Items of Value
  • Grafana Labs Acquires Pyroscope to Add Code Profiling Capability

Home » Blogs » DevOps Practice » AIOps: The Path to Greater Resilience and Uptime

AIOps: The Path to Greater Resilience and Uptime

Avatar photoBy: Prashant Jain on November 4, 2019 1 Comment

Large organizations with complex IT infrastructures are facing a significant challenge when it comes to data. The complexity is such that many IT operations teams are struggling to detect, or predict, the complex failure conditions and circumstances that can adversely impact resilience and uptime.

Related Posts
  • AIOps: The Path to Greater Resilience and Uptime
  • PagerDuty’s Real-Time AIOps-Powered Digital Operations Platform Integrates with Amazon DevOps Guru
  • DevOps and AIOps: Better Together
    Related Categories
  • Blogs
  • DevOps Culture
  • DevOps Practice
  • Enterprise DevOps
    Related Topics
  • AIOps
  • big data
  • IT operations
  • machine learning
  • performance monitoring
  • service level agreement
  • SLA
Show more
Show less

Monitoring a multitude of systems is a complex task, as the data generated by any system can be vast and far too overwhelming for manual IT operations processes, particularly in relation to analyzing, correlating and prioritizing alerts and remediation.

Companies have much to gain from a better understanding of the data being created from monitoring tools and practices: increasing the resilience of systems, and the speed of remediation cycles delivers significant cost and time savings. Harnessing advanced technologies to deliver more intelligence to teams in charge of performance monitoring is becoming increasingly important, which is where AIOps can deliver significant value on top of existing monitoring and automation tools.

Cutting Through the Noise

Telemetry tools deliver staggering amounts of data. The Microsoft Azure telemetry platform, for example, records 10 petabytes of data per day. When dealing with this volume, the ability to intelligently decipher the severity of alerts becomes vital, especially as many of them may be duplicates or not meaningful in any way. Yet it is IT operations teams that are tasked with understanding how systems are performing, and where resources should be diverted to ensure services are always up and running. To do this effectively, the root causes of alerts must be identified quickly and correctly.

AIOps is increasingly becoming the go-to solution to the problems inherent in dealing with big data. By applying data collection, data modeling and data analytics techniques and using machine learning algorithms to establish patterns, the noise of the data is reduced as operations teams gain more intelligent insights into alerts. So, providing more context to issues flagged within systems is the goal. Eventually, somebody has to take action when an alert is created, so the more intelligence engineers have, the easier it will be to remediate.

The less context delivered with alerts, the more time engineers will need to resolve the issue, as the manual remediation process will likely involve interaction with other teams. Take application performance monitoring as an example. In my experience, operations teams do not have an end-to-end understanding of the applications that are live in the production environment. If an alert is created, the ops team will typically seek assistance from a member of the developer team who built it, or another colleague with a deeper understanding of the application, which of course extends the lead time of the remediation process.

Key Considerations for Deploying AIOps

By establishing data patterns through machine learning and other advanced analytics technologies, the following key drivers for alerts assessment using AIOps can be established:

Frequency: Discovering how regularly alerts are created and whether it warrants automation for analysis, correlation, prioritization and remediation.

Coverage: Telemetry is of course vital to AIOps, as the machine learning algorithms will only be as good as the data they ingest. Putting in place the right telemetry tools is therefore vital to ensure the right data is being gathered. Failing to do so increases the risk of missing critical issues.

Impact: Not all problems require equal energy and resources. Certain issues might only occur every few months and may not be critical. In such situations, it is not prudent to invest significant amounts of time and money into developing the machine learning algorithms that will identify where and when an alert will be created. Again, cutting through the noise of the telemetry will help to identify high-priority alerts.

Probability: What is the likelihood of certain issues recurring? Can automation tools, informed by the machine learning algorithms, be implemented to deal with these, i.e. learn where and when they are likely to occur so they can be dealt with by the AIOps platform autonomously?

Take, for example, the issue of CPU or memory going beyond a certain threshold. Rather than manually contacting customers when an issue is flagged to inform them that services are about to go down before being spun up again, an AIOps platform can automate this process, so remediation happens seamlessly.

Latency: Automation of the remediation process for an alert deemed critical requires increased investment in AIOps. Getting good data from the early stages of AIOps processes will reduce the amount of time taken to both flag and deal with a problem. Containerized microservices, for example, take very little time to recover, so very little investment is needed when applying AIOps tools that automate remediation. Database recovery, however, is a more complex process, the automation of which would require significant investment.

What’s the Customer Value?

For service providers, AIOps is essential if they are to attain the highest levels of resilience, the measure of which all comes down to the Service Level Agreement (SLA). The SLA is measured by a percentage that corresponds to the time systems and are up and running over a year. So, 99.99% uptime is good, 99.999% is great, but 99.9999% is where service providers want to be. To put this into context, downtime for an entire year in the latter scenario amounts to just 32 seconds.

Increased automation is the only way to achieve optimal SLA. To do this, service providers are increasingly turning to AIOps solutions.

Operations teams are in charge of the IT strategies that align with business objectives. For service providers, a key part of this is, of course, ensuring maximum availability and optimal performance of the customer-facing environment. Understanding when and where faults appear, and improving resilience against them, is the best way to ensure this. That is where AIOps truly adds value.

— Prashant Jain

Filed Under: Blogs, DevOps Culture, DevOps Practice, Enterprise DevOps Tagged With: AIOps, big data, IT operations, machine learning, performance monitoring, service level agreement, SLA

« State of the Art
Breaking Up and Making Up: Composable Persistent Storage for Kubernetes »

Techstrong TV – Live

Click full-screen to enable volume control
Watch latest episodes and shows

Upcoming Webinars

How Atlassian Scaled a Developer Security Solution Across Thousands of Engineers
Tuesday, March 21, 2023 - 1:00 pm EDT
The Testing Diaries: Confessions of an Application Tester
Wednesday, March 22, 2023 - 11:00 am EDT
The Importance of Adopting Modern AppSec Practices
Wednesday, March 22, 2023 - 1:00 pm EDT

Sponsored Content

The Google Cloud DevOps Awards: Apply Now!

January 10, 2023 | Brenna Washington

Codenotary Extends Dynamic SBOM Reach to Serverless Computing Platforms

December 9, 2022 | Mike Vizard

Why a Low-Code Platform Should Have Pro-Code Capabilities

March 24, 2021 | Andrew Manby

AWS Well-Architected Framework Elevates Agility

December 17, 2020 | JT Giri

Practical Approaches to Long-Term Cloud-Native Security

December 5, 2019 | Chris Tozzi

Latest from DevOps.com

5 Unusual Ways to Improve Code Quality
March 20, 2023 | Gilad David Maayan
Bug Bounty Vs. Crowdtesting Programs
March 20, 2023 | Rob Mason
Five Great DevOps Job Opportunities
March 20, 2023 | Mike Vizard
Items of Value
March 20, 2023 | ROELBOB
Grafana Labs Acquires Pyroscope to Add Code Profiling Capability
March 17, 2023 | Mike Vizard

TSTV Podcast

On-Demand Webinars

DevOps.com Webinar ReplaysDevOps.com Webinar Replays

GET THE TOP STORIES OF THE WEEK

Most Read on DevOps.com

SVB: When Silly Valley Sneezes, DevOps Catches a Cold
March 14, 2023 | Richi Jennings
Low-Code Should be Worried About ChatGPT
March 14, 2023 | Romy Hughes
Large Organizations Are Embracing AIOps
March 16, 2023 | Mike Vizard
Understanding Cloud APIs
March 14, 2023 | Katrina Thompson
Addressing Software Supply Chain Security
March 15, 2023 | Tomislav Pericin
  • Home
  • About DevOps.com
  • Meet our Authors
  • Write for DevOps.com
  • Media Kit
  • Sponsor Info
  • Copyright
  • TOS
  • Privacy Policy

Powered by Techstrong Group, Inc.

© 2023 ·Techstrong Group, Inc.All rights reserved.