Or, why IT systems keep breaking
The tension between the two halves of DevOps is long-standing, going back long before the term was even invented. What underlies the split between Development and Operations is a fundamental difference, like an actual physical law. Let’s explore that analogy together.
IT Ops is fundamentally a constant fight against entropy, meaning disorder. Sysadmins expend significant energy trying to keep everything up and running, in a process which can sometimes feel like trying to shore up a sandcastle while the tide is coming in. Systems are constantly breaking down; users need changes made that introduce risk; and, as if that wasn’t enough, the whole shebang is constantly under attack from malicious outsiders. It’s no wonder that sysadmins have a reputation for being a bit grumpy!
In contrast, Dev actually increases the level of entropy. Every new release, be it as small as a one-line fix or finding that errant semicolon, is a change that has to be made, reducing the order that Ops fights so hard to maintain. Dev is entropogenic, embracing change as a positive force.
The interaction between Dev and Ops is analogous to the second law of thermodynamics: Entropy can only increase over time, unless energy is introduced to the system from outside. In other words, unmanaged change will eventually cause something to break, unless people put in continuous effort to prevent those breaks.
Thermodynamics: Lots of Heat, not Much Light
These conflicting goals, taken to extremes, can get unhealthy, with Ops actively fighting change or Dev going around Ops to increase their velocity of change. Left unchecked, this tension can turn into a vicious cycle, as frustration with the “Department of No” drives developers into unmanaged “shadow IT.”
The good news is that the analogy with hard laws of physics is not perfect. The process of increasing entropy is not irreversible, and does not require constant inputs of energy to prevent. New agile approaches to IT Ops are emerging that are light and flexible, taking their structure from that tension. In the physical world, this is called “tensegrity,” and was originally formulated by Buckminster Fuller, one of those annoying polymaths who seem to make breakthroughs in every field they touch.
IT Ops needs its own equivalent to tensegrity, a flexible, holistic structure that is able to embrace change while maintaining its shape. Structures that try to resist change through rigidity are brittle and ultimately fragile.
Today, this flexible strength is enabled by AIOps, a term coined by Gartner to describe a new way of operating IT, based on augmenting human skills with automated analysis enabled by modern algorithmic and machine-learning approaches.
Complexity Breeds Complexity
Modern IT is too complex for any single person or even team to keep track of, instead requiring multiple specializations. In turn, each of these functions requires its own specific support tools that enable that particular role. Beyond a certain point, it becomes very difficult to combine these partial views to gain a holistic understanding of what is actually happening in the environment.
Because of the scale of modern IT environments and the sheer rate of change that businesses require, old-style rules-based “Managers of Managers” can no longer keep up, requiring more effort to build and maintain their worth. Especially as infrastructure becomes more and more self-modifying in response to changing conditions and demand, IT Operations needs to accelerate just to keep up.
AIOps brings developing issues to the attention of sysadmins before they become visible to customers and end users, and gives them the tools to collaborate effectively between different teams and specialist roles to avoid outages or restore service quickly and efficiently. In this world, automation is used to augment the capabilities of the human operators to make them more effective, letting people concentrate on adding value, not just fighting fires.
This is how Ops can get out of the business of fighting Dev, and instead become a supportive partner. AIOps can catch issues that are affecting services and flag them to the right people early on, including the DevOps engineers that are responsible for production support.
For developers, one of the fears around DevOps is that in “picking up the pager,” they are opening themselves up to a constant stream of irrelevant alerts. Old Ops hands nod wryly at this, or maybe try to share their personal approach to filtering the noise. Instead, AIOps ensures that only valid, actionable alerts get as far as being shown to a human being and taking up their valuable time.
Eyes on the Prize
It’s not all roses. As the AIOps market begins to mature, there is a certain amount of “algorithm-washing” going on, with more and more vendors trying to claim that either their legacy tool was AIOps all along, or that their one-trick pony is actually a whole three-ring circus. What gets lost in the excitement is what all of those algorithms are actually supposed to achieve.
In considering a shift toward AIOps, the goal should not be simply to build a better mouse trap, meaning a smarter filter for monitoring events. The success of the project can be measured not so much in the mathematical results of the algorithms, but in the smoother exchange between the human users. This easier collaboration is especially important across team boundaries, whether these be different specialist units within the Ops function or developers and software engineers who need to work with Ops to understand the performance of their code in production.
The benefit of transforming operations this way—moving from sequential, waterfall Ops to abile, collaborative Ops—is to turn DevOps-enabled teams into a sort of perpetual-motion machine, able to continue operating without requiring constant massive inputs of energy from outside the system. Modern applications and the complex infrastructures that support them cannot be managed with the sorts of rigid, cobbled-together, industrial approaches that were appropriate in the mainframe days. AIOps is the agile, flexible approach that can not only accomodate that complexity, but thrive on it.