The new world of IT requires a different approach from what worked in the past. As usual, the technology is the easy part, but without people and processes being aligned, it won’t deliver value. When evaluating how data science and machine learning can help manage highly dynamic IT environments, old reflexes may be actively harmful.
By their nature and due to their experiences, Ops people tend to be a pretty (small-C) conservative bunch. Rule No. 1 of Ops is “If it ain’t broke, DON’T TOUCH IT!” and if we’re honest with ourselves, most of the rest of the rulebook is just variations and commentaries on that theme.
One of the ways in which Ops people try to control their world is through automation. DevOps mostly looks to the moment of release and deployment, as that is where the handover to Ops typically occurs—and this is a big focus of the drive to automate. However, there is another part of Ops where automation is key, something that Dev is not always nearly as involved in; namely, day-to-day operations. Which alerts should even be sent, whom they should be routed to, what are the valid responses and so on—all these have been automated over the years, to a greater or lesser extent.
Ops people have laboriously documented their architecture and spent a long time in meetings planning which information is relevant to share and who should respond to alarms. Much of this is implemented in various pieces of software, whether commercial, open-source, home-grown or that special grey area, where something that started out as a standard component has been so customized that it is effectively bespoke. Rules, filters and thresholds determine what action to take in which case.
These approaches have been in place for years or even decades, but now they are starting to fail. Simply put, IT’s increased complexity and the ever-accelerating rate of change are outpacing administrators’ ability to keep up and reconfigure their management systems.
The Coming ‘New Normal’
New approaches are emerging to deal with the new normal, but as ever, the technology is the easy part. For new technology to be a success, people and processes need to be in sync.
Right now, the emerging approach is to use data science and machine learning to process events, instead of the old deterministic rulesets and databases. The results speak for themselves, especially at scale and in highly dynamic environments. People building container-based or SDN architectures no longer assume that there will be a database of configurations that will always be up to date; quite the opposite.
The new way is to throw out all the filters and rules, feed all the alerts into a single place, and then use data science and machine learning techniques to sift them and identify relationships with them. The result will be a shift in focus from individual alerts to the business problems those alerts are the symptoms of.
Algorithms are great at these sorts of repetitive, high-volume tasks. They can look at massive volumes of events and figure out which are even relevant and worth taking a second look at, and then identify how those events relate to each other and what they really mean. This is where the humans come in, with their strengths in low-volume and unpredictable situations. Because they are not constantly drowning in irrelevant noise, they can work together effectively to understand the actual problem and get it fixed quickly.
Where’s the Catch?
The difficulty is that for many Ops people, there is a big cultural change required to adopt this model. The old deterministic systems may have required laborious maintenance, but at least it was possible to model them and understand why a specific result was achieved. New algorithmic techniques take a radically different approach, and while the results are impressive, they come from a black box. This does not inspire confidence for people who are very comfortable with the old approaches.
Often, the response of Ops people coming from legacy tools is to get stuck into the minutiae of what is going on inside the black box and to look for traditional failure modes. The problem is that these approaches are no longer valid in the new world, as I discussed in my last post about getting to SRE.
For instance, while it is indeed important to make sure that no issues are missed due to over-aggressive filtering, this does not mean that we need to see every single alert. As long as we know about the actual business problem, we don’t actually care about the single alerts, except insofar as they help us debug the underlying issue and resolve it.
The other direction is even more insidious, where people spend a lot of time chasing down “extraneous” events that they think they should not be seeing. The risk is, of course, that the Ops team ends right back up in the same place that they were trying to escape, where they are only able to identify expected problems, but are getting blindsided on a regular basis by the “unknown unknowns,” the conditions that simply should not happen.
Ultimately, Ops needs to let go of reflexes that were good in the old world, and focus on the goal: ensuring the uptime and stability of business-supporting systems. The how is less relevant than the what and the why; the algorithmic black box might as well be full of magic leprechauns, as long as they are telling Ops what they need to know, when they need to know it. Call it the “Chinese Room” theory of Ops: For as long as the results are useful to Ops, it does not matter what is inside the room.
The benefit for Ops is precisely that they no longer need to focus on the minutiae of individual alerts, but can concentrate on solving actual business problems. That is how IT can make itself a value generator and avoid being viewed as another undifferentiated cost center.
Pay no attention to the algorithm behind the curtain.