The idea of “full-stack developers” is becoming a fairly common one, although it is still widely contested. But what about full-stack operations?
There are cycles in the IT industry, and if you wait long enough, you can see the same ideas swing in and out of favor. Business applications used to be fairly monolithic, and so it was expected that their programmers would have a good understanding not just of the application’s code and business logic itself, but of its underpinnings as well.
Over time, different parts of those underpinnings were separated, breaking out data silos in a process of modularization. With the increasing complexity of architectures, the expectation became that specialists would be in charge of each module and simply ensure that standard interfaces were available to teams whose modules needed to interact with each other.
While modularization at the architecture level is here to stay, there is now some push back against it being extended too aggressively to the hum
an developers in charge of building the system. This takes the form of a quest for “full-stack developers,” who are able to understand all the nuances of the entire application stack.
There is a similar need on the IT operations side of the house. Different specialized support teams used to be able to focus on one particular area (server, network, database, middleware and so on) and use their own specific tools to analyze and understand the performance of one particular module of a complex enterprise architecture. The assumption was that if each module was up, then ipso facto, the entire stack was up, and therefore the application as a whole was also up.
As it turns out, it’s not that simple.
Let’s Do Some Root-Cause Analysis
These days, failures of business services rarely are caused by a catastrophic outage of one particular component. Hardware is cheap, resiliency is built in and experience from web-scale environments has trickled down to become common best practices. What actually causes problems—such as getting you embarrassing news coverage or angry customers demanding refunds—is the unexpected: many seemingly minor and isolated issues that suddenly add up to a big problem.
When this sort of cascading failure occurs, modular operations teams and processes are not able to cope. Multiple tickets are opened for each affected module and routed to that module’s support queue. Each of those tickets contains only partial information, by definition, and may even describe only a symptom of a failure in another module. Because of this, precious time is wasted diagnosing, reassigning and escalating those tickets, until eventually a complete picture of the actual problem is laboriously assembled.
At this point, representatives of all of the various affected support teams assemble, either in person or on a conference call, and try to determine what to do to restore service as quickly as possible. All too often, these efforts are further delayed by unproductive blamestorming and finger-pointing, but eventually, the issue does get resolved.
Given the complexity of modern-day enterprise architectures, it is impractical for any single IT operations specialist to understand the entire stack in depth, including all the interactions between different modular components. Even if it were possible to do so for a snapshot in time, these days most environments are self-modifying to a greater or lesser degree, provisioning or decommissioning capacity in response to demand, balancing load across available resources and performing routine maintenance tasks. This sort of dynamism is impossible for humans to model accurately in real time.
How Is It Possible To Do Full-Stack Ops?
No single person can be a full-stack ops specialist, but that does not mean that the whole team cannot work to deliver full-stack support. The key is to break down the walls between the different siloed teams supporting each module, and make it easier to build an overall picture of, yes, the full stack that is supporting a particular business service.
This bringing-together must happen in real time, because that is the pace of business today. An outage that has a technical duration measured in a handful of minutes may cause hours-long business repercussions. Ops teams need tools that can relate signals from different modules into a comprehensive picture of what’s going on in the IT environment, proactively inform the right people, and give them the support they need to pool knowledge and work together effectively to resolve the outage and restore service to affected users.
These tools cannot be simple incremental improvements in the monitoring of one particular module, important as that is. The responsiveness that the organization as a whole expects from its IT operations requires that silos be broken down and information and expertise be pooled. Real-time collaboration between specialists in different domains is the only way to deliver truly full-stack operations.
The rapidly maturing discipline of AIOps aims to do just that. The beauty of this approach is that, unlike rip-and-replace approaches in the past where moving to the new thing meant ripping out the old thing, AIOps takes advantage of what was done in the past, simply joining together what were previously disconnected islands of knowledge and making information available when it is needed.
Take a look at some of the players building out these capabilities in the AIOps Market Survey from Gartner. Normally these documents cost $1,295, but you can get it for free courtesy of Moogsoft at this link.