Staffing levels within IT operations (ITOps) departments are flat or declining, enterprise IT environments get more complex by the day and the transition to the cloud is accelerating. Meanwhile, the volume of data generated by monitoring and alerting systems is skyrocketing, and Ops teams are under pressure to respond to incidents more quickly.
Faced with these challenges, companies are increasingly turning to AIOps – the use of machine learning and artificial intelligence to analyze large volumes of IT operations data – to help automate and optimize IT operations. But before investing in a new technology, leaders want assurances that it will bring value to end users, customers and the business at large.
Leaders looking to measure the benefits of AIOps and build KPIs for both IT and business audiences should focus on key factors such as uptime, incident response and remediation time and predictive maintenance to prevent outages that could affect employees and customers.
AIOps KPIs include employee productivity, customer satisfaction and web site metrics such as conversion rate or lead generation. AIOps can help companies cut IT operations costs through automation and rapid analysis; it can support revenue growth by enabling business processes to run smoothly and deliver excellent user experiences.
Specific IT and Business Benefits
AIOps can digest operational data and spit out actionable recommendations for keeping important systems running at peak efficiency. These are the most-cited benefits of the technology:
- Alert management: In most cases, the first problem AIOps systems will address is reducing the volume of noise: the torrent of alerts that inundate IT operations. AIOps uses clustering and pattern matching algorithms to eliminate as much as 90% of false alarms and other redundant or irrelevant alerts, making it far easier for staffers to focus on what really matters.
- Incident prioritization and routing: AIOps systems can learn, over time, which types of alerts should be sent to which teams, reducing redundancy and confusion when, say, networking and database teams both get the same alert related to an incident.
- Event correlation: AIOps can correlate alerts and event data to identify the root cause of an outage or application slowdown so that IT teams can respond faster.
- Advanced anomaly detection: AIOps systems can identify anomalies to proactively detect abnormal conditions and relate them to business impact. For example, AIOps can predict if a system will run out of disk space based on projected growth or seasonal patterns, even if the growth is non-linear. Or, if there’s a sudden increase in the number of failed server requests, the technology can determine whether the server in question is handling a mission critical task or simply performing routine backups.
- Automation: AIOps can be used to handle routine tasks such as backups, server restarts and other low-risk maintenance activities which otherwise involve heavy manual effort.
- Predictive analytics: A more advanced use case could involve predicting events before they happen – such as detecting when network bandwidth is reaching its limit or storage capacity is nearing threshold.
Seven KPIs for AIOps
These common KPIs can measure the impact of AIOps on business processes:
- Mean time to detect (MTTD): This KPI refers to how quickly it takes for an issue to be identified. AIOps can help companies drive down MTTD through the use of machine learning to detect patterns, block out noise and identify outages. Amid an avalanche of alerts, ITOps can better understand the importance and scope of an issue, which leads to faster identification of an incident, reduced downtime and better performance of business processes.
- Mean time to acknowledge (MTTA): Once an issue has been detected, IT teams need to acknowledge the issue and determine who will address it. AIOps can use machine learning to automate that decision-making process and quickly make sure that the right teams are working on the problem.
- Mean time to restore/resolve (MTTR): When a key business process or application goes down, speedy restoration of service is key. ITOps plays an important role in using machine learning to understand if the issue has been seen previously and, based on past experiences, to recommend the most effective way to get the service back up and running.
- Service Availability: Often expressed in terms of percentage of uptime over a period of time or outage minutes per period of time, AIOps can help boost service availability through the application of predictive maintenance.
- Percentage of automated versus manual resolution: Increasingly, organizations are leveraging intelligent automation to resolve issues without manual intervention. Machine learning systems can be trained to identify patterns, apply solutions, such as previous scripts that were executed to remedy a problem, and take the place of a human operator.
- User Reported versus Monitoring Detected: IT operations should be able to detect and remediate a problem before the end user is even aware of it. For example, if application performance or web site performance is slowing down, ITOps should get an alert when the issue can be measured in milliseconds; they can then fix the issue before the slowdown worsens and affects users. AIOps enables the use of dynamic thresholds to ensure that alerts are generated automatically and routed to the correct team for investigation, or automatically remediated when policies dictate.
- Time savings and associated cost savings: The use of AIOps, whether it’s to perform automation or more quickly identify and resolve issues, will result in both operator time savings and business time-to-value improvements. These have a direct impact on the bottom line.
These seven AIOps KPIs can be correlated to business KPIs around user experience, application performance, customer satisfaction, improved e-commerce sales, employee productivity and increased revenue. ITOps teams need the ability to quickly connect the dots between infrastructure and business metrics so that IT is prioritizing spend and effort on real business needs. In the future, as machine learning matures, AIOps tools may even recommend ways to improve business outcomes or provide insights as to why digital programs did or did not succeed.