The DevOps movement often has been accused of focusing too much on the first half (Development) and not enough on the second half (Operations). Certainly there has been more attention paid to deployment of payloads than to operating running systems, leading to the dismissal of the handover between Dev and Ops as “throwing it over the wall.”.
Lately, we have seen the emergence of a new focus on the stability and reliability of the systems that are the targets of those deployments, with the creation of the new practice of site reliability engineering (SRE). This has been a welcome addition to the IT toolbox, but it can still seem to put all the onus on Ops teams to catch whatever comes over the wall.
The Problem with Site Reliability Engineering = IT Infrastructure
In SRE, the IT infrastructure is expected to be highly automatic and self-healing in the face of any events. Here is the problem: That approach works well for foreseeable events, but less well for the unforeseen ones. Because of that factor, it works well in environments that run few types of individual workload types, but do so at very large scale. If you’re thinking that sounds like Google and Facebook, you’d be right.
What this amounts to is massive engineering of the infrastructure to withstand known or foreseeable problems. However, typical enterprise IT environments are not like that. A large bank might have thousands of applications, each of which must accommodate changes on a pace that is dictated from outside the IT environment. Applying assumptions from one environment to the other is asking for trouble.
To illustrate this, let’s look at an example from outside IT. Recently was the anniversary of the sinking of the Titanic, on April 15, 1912. The RMS Titanic itself embodies the principles of SRE, engineered to prevent or survive all manner of emergencies. The ship was equipped with state-of-the-art everything to transport its passengers in comfort and safety across the Atlantic.
As we all know, that plan did not quite work out. A combination of unexpected changes in the environment (more icebergs than normal), business imperatives to maintain speed and mishandled warnings about the ice from other ships led the Titanic to disaster.
While the consequences of an IT failure are rarely quite as dramatic as the sinking of an ocean liner, many of the same factors apply. IT administrators may believe that in the worst case their backups and disaster recovery plans will be sufficient to handle any problems, but these plans are designed around known and foreseeable problems. Unexpected circumstances can easily lead to cascading failures, which is why it is critical to be ready for the unexpected when it—inevitably— occurs.
The Problem with Traditional IT Ops = Static Models
The key flaw in IT Ops is reliance on static models of the world. These models come in many shapes – the CMDB is the “model” that many think of first, but static rulesets are also a model. The most dangerous models, though, are the ones that are invisible: namely, the filters that are put in place to determine which alerts and events are even worth considering.
This is the flaw that ultimately sank the Titanic: The crew were unable to correctly priorities events and react to them in time to avoid failure. It is also the flaw that causes countless much more minor issues every day in data centers everywhere: the filesystem that filled up with logs, taking down the database that the critical business application relied on—and the filter that had not forwarded that alert because it usually was harmless. Or the loss of one leg of the redundant network link, which was ignored because the other leg was still up, so there was still a link—until that one failed too, and an entire site went dark.
The key factor is that the filters were not wrong to suppress those alerts. By definition, informational or warning alerts are not the same as major or critical ones, and most of the time they can be safely ignored, to be dealt with later.
Every now and then, though, a pattern of those alerts, if understood correctly, can identify a future problem in the making. These developing issues could be nipped in the bud, if only Ops teams had enough hours in the day to look at them and review them. But, of course, they don’t—every Ops team I have ever met is drowning in issues, and behind those there is a long and lengthening to-do list.
What Can DevOps Do to Make Operations Better
New approaches are emerging to make the principles behind SRE more widely accessible and applicable. In particular, more dynamic noise reduction and correlation is now possible, to sift those important alerts from the constant background noise and put them together into a picture of what is really happening. The key factor is to be able to do this in real-time and without a human having to plan out laboriously what are all the possible scenarios that they might need to know about in the future.
Gartner has called this new approach Algorithmic IT Operations, or AIOps. The idea is to bring together all possible sources of events, whether those are alerts from the compute or network infrastructure, transaction slowdowns reported by an APM tool, automated deployments being run from a CI/CD toolchain, or anything else the might conceivably be relevant. All of this information then can be sifted by algorithms to understand what is actually important to Ops and brought to the attention of the right specialists who can work on the issues and get them solved fast. Part of that process is also the integration with systems of record (which generally means IT service management), and with automation and orchestration tools that can accelerate remediation activities.
This is how we can get the Ops side of DevOps to where they need to be to be able to accommodate the ever-accelerating pace of change—whether from Dev teams wanting to run ever more frequent deployments, or from business users needing help with their own goals—or from the next unexpected issue to come down the line.