Incident management for important applications has many connotations, many of them negative. Most are well-deserved. Stress, panic, hurry, urgent, drop-everything, all-hands-on-deck, emergency, to name a few. Typically, there is a critical problem, ranging from complete failure to loss of certain functionality or performance. The problem or outage potentially means loss of revenue, loss of customers, negative social media commentary or bad reviews and more. Time is of the essence, dictating that the primary objective is fixing the incident as quickly as possible. Often, this means that symptoms or partial causes are mainly addressed rather than the underlying root cause.
Without fully addressing the underlying issue, companies virtually guarantee that the same problem – or a similar one – will reoccur. Not identifying root cause often prevents a durable fix. In addition, companies lose the opportunity to proactively improve application code or infrastructure based on real-world experiences and issues. Postmortems may only result in reviews of monitoring and observability solutions and the inevitable updates to alert rules. Most DevOps professionals not only understand but have lived through these frustrations on an ongoing basis. Management, then, often wonders why their systems are so unstable.
Changing the model for incident management has been limited by a combination of the overriding urgency combined with short-staffed, overworked teams. Although AI and machine learning has been positioned as the panacea for nearly every kind of technical ill, this is a clear case where “machines” could fundamentally enhance human efforts to improve a situation. The best troubleshooters exhibit a combination of instinct, experience and patience to carefully sift through reams of data, spotting unusual events and their correlation with bad outcomes. This turns out to be a perfect application for machine learning.
By ingesting all logs and metrics to find meaningful problems or anomalies, systematized machine learning can quickly and comprehensively sift through all data to gain a thorough understanding of the real problem. Using this approach significantly speeds up the root cause analysis. Such an applied solution introduces three game-changers for incident management.
Three Game-Changers for Incident Management
First, incident remediation times can be reduced dramatically and with less of an all-hands-on deck, drop-everything frenzy. Rather than SREs and engineers spending hours digging through dashboards, traces and millions of log lines, an automated approach can accomplish it more quickly, more thoroughly and with less impact on human professionals, allowing them to focus on other responsibilities. This instantly relieves stress and panic and alleviates pressure because incidents can be remedied better and faster, restoring the application to its expected service level. The value is, of course, two-fold: faster time to resolution and greater efficiency or productivity of the teams who would otherwise be involved. Ending or mitigating the frustration professionals have to deal with is an additional bonus.
Second, besides faster MTTR, an automated, machine-learning-driven approach is more comprehensive and gets to the root problem. This means that future incidents may be avoided altogether. This also means a potential end to the alert rule “hamster wheel.” The knee-jerk tendency to continually adjust alert rules and settings to enable better “early warning systems” is far less necessary. Instead of trying to gain an early indication of a similar failure in the future, the cause of such failures can be found and addressed proactively. Eliminating such problems in the first place is far better than trying to evolve alerts to catch unresolved causes. Alerts are still important, but machine learning can shift the basis for them to more root-level issues, and they can be constructed automatically by the system.
Third, automated, machine-learning-based incident management can help proactively catch silent bugs that may not yet directly be part of an incident. It can also inform developers earlier in the cycle. Not terribly long ago, new releases were tested extensively before deploying to production, allowing for the creation of specific test plans, conducting stress tests and potentially catching bugs that might have potentially terrible consequences in the future. Now with faster, non-stop development cycles, testing mainly occurs during production. Machine learning can surface new or unusual errors, event patterns and patterns in metrics. The approach can proactively surface subtle bugs early, before they impact users or cause widespread failure.
Changing the DevOps Life Cycle
Each of these three results significantly changes not only the incident management life cycle, but also DevOps as a whole. Poring over logs and metrics is a perfect application for machine learning and offers the opportunity to fundamentally change incident management, transforming it from a scourge and necessary burden to a competency. An automated system of machine learning applied to incident management advances DevOps rather than encumbers it.