From Waterfall Incident Management to Collaborative IT Operations

What can Ops learn from Dev in the transition from sequential, reactive IT support to real-time DevOps?

One of the signature ideas of DevOps is right there in the name: the blurring of the boundary between the historically separate worlds of Development and Operations.

On the face of it, this seems like a great idea. Developers have always grumbled about how Operations is preventing them from deploying as rapidly or as frequently as they would like. Meanwhile, Operations people have harbored a dark suspicion that developers might code with more attention if they were the ones getting paged at 3 a.m.

Another of the great things about DevOps is that it is pretty easy to adopt in one area or project without needing to get the entire company onboard before you can get started. This means that you can quickly produce some interesting results in terms of accelerated development times and release frequencies, faster resolution of problems and more time spent innovating instead of firefighting.

Armed with these results, DevOps evangelists spread through the organization, telling their story and recruiting other departments and teams to join them in the bright future.

This is Where the Problems Start

Complex enterprise IT environments are, well, complex. Different teams have adopted different tools and methodologies to do their jobs, and are justly resistant to throwing out something that works for them. Specialists in different areas require their own specialist tools to be able to do their jobs. Because of this, when an incident occurs, all of that agility and collaboration that DevOps brings goes out of the window in favor of something that looks much more like a classic “waterfall” approach:

When an event is received, it is routed to one particular team
The event is placed in a queue
Eventually, someone from the relevant team takes a look at the event and makes a determination of whether it falls in their area or not
If it’s in their area, they begin working to resolve the issue within their team
If it’s not (believed to be) in their area, they reassign it, either back to the front-line team, or to whichever team they believe to be responsible (often, this is the network team)
Repeat until either someone takes ownership and fixes the issue, or it becomes a Major Incident and everyone gets paged at 3 a.m.

Just to complicate matters, there may be multiple instances of this process going on at any one time, as two or more different teams grapple with what eventually turn out to be different aspects of the same problem.

Why Does This Happen?

In health care, there is only so much that can be done by way of prevention. There is always the need to react to patient’s issues, starting from their symptoms. In the same way, the starting point for Operations is monitoring—gathering the symptoms of the IT infrastructure to be able to return it to health as quickly as possible.

To understand problems with the Ops part of DevOps, we needed to start at the beginning, with those symptoms. We wanted to do this the scientific way, so we polled attendees of Velocity and Monitorama to understand their top monitoring challenges.

The interesting fact was that across both populations, the main challenge listed by attendees was “alert noise/fatigue/volume.” Simply put, Operators were drowning in alerts to the point that they had completely lost the ability to understand what was actually important, let alone an overall picture of how those alerts related to the service that IT was supposed to deliver.

This is another parallel between IT and health care, where alert fatigue has long been a concern, and with potentially far more serious consequences than in IT.

A big part of the cause of the alert fatigue on the IT side is the proliferation of specialist monitoring tools. In the poll, 66.67 percent of people we surveyed have five to 10 monitoring tools in place, and yet 61.90 percent of them are still struggling with that problem of alert noise/fatigue/volume.

To be clear, those different tools are each very important. Trying to monitor the network with an APM tool is probably not going to go well for you. They only become a problem when it becomes too difficult to build up an overall picture of what is going on, and sharing that between different people to solve problems faster.

So What Can Be Done?

Dedicated Operations organizations have long worked to assemble tools and techniques that can give them a complete view of the environment that they are supporting, and of the services that run in it. The accelerated pace of release that comes with DevOps is causing the old models to break down, but fortunately, it also carries in itself the seeds of a new approach to the problem.

As development moved from the sequential waterfall model to a more rapid, collaborative, and agile approach, operations teams can learn from that transition to come up with a more agile approach to IT Ops.

What would this look like? The first thing to do is to get rid of the blame-storming and buck-passing: “It’s the network!” “No, it’s the database!” I don’t know about you, but personally, I’ve been in enough of those meetings to last me for the rest of my life, thank you very much.

Let’s All Go Fishing in the Data Lake

Instead, a common perspective is needed, so that everyone can quickly agree on what is going on and what needs to be done.

This does not mean throwing out all of those specialist monitoring tools! They are there for a reason. Rather, gather all of those symptoms in one place where the whole team can collaborate to make sense of them more easily. These days, it should be easy to get information in and out of the various components of your toolchain (and if it’s not, that’s a glaring warning sign that maybe it’s time to migrate off that particular tool).

This will, of course, produce a huge pile of data, so it’s not sufficient to stop here with the creation of a “data lake.” Things disappear into lakes and are never seen again—in which case, why did you bother gathering them in the first place?

What you need is a way to make sense of what is going on in the lake, and show that to the right human operators. In the same way that developers resent sitting around waiting for the Change Advisory Board to meet, operations people get frustrated and bored just staring at a screen waiting for something to happen. If instead you can use automatic techniques to assemble all the related symptoms of a single issue together in one place, that will give you a good starting point to resolve that underlying issue quickly.

Brainstorming, not Blamestorming

This is where the collaboration comes in. Again, people are doing this already, but they are doing it in a disconnected way. The tools have evolved, from email and IRC, to Slack and Google Hangouts, or even (heaven help us all) WhatsApp. What has not evolved is the usage model; it’s still ad-hoc and reactive, and requires lots of copying and pasting.

Operations people can learn from what their developer colleagues have built or adopted. Kanban boards are a great technique for tracking open issues and who is working on them. ChatOps is a fantastic way to integrate human-to-human communication with human-to-machine communication. Integrated collaboration tools capture knowledge that otherwise is stuck in people’s heads or inboxes, unavailable to their colleagues.

I’m an Ops guy at heart, but I can recognize a good idea when I see it. Our Dev colleagues have come up with quite a few that we could stand to adopt in Ops. Let’s learn from what they already know to work, instead of trying to reinvent the wheel.

— Dominic Wellington