One aspect of the DevOps movement I’ve seen adopted at numerous companies is the idea that everyone supports their products by being on-call for any incidents that occur in the production environment. This responsibility often leads to participation in the outage war room. For those of you who may be new to this experience, I’m offering this primer from a grey-hair ops guy who’s served countless hours in every manner of war room you could imagine, some that stretch the bounds of that imagination, and a few that I just still can’t talk about. Been there. Done that. Still doing it.
In all that time, however, I’ve observed four distinct war room dynamics that can cause conflict. Understanding how that conflict arises is key to mitigating it.
The war room is a stressful place
This statement sounds obvious, but it’s fundamental to understanding the tensions that can arise in a war room. The clock is ticking. In fact, the clock has already been ticking for some time when you join the room. These things don’t happen instantaneously. When you get the call to join a war room, the problem has already manifested and is causing an impact. Even with automation, it takes time to notify people and have them connect to the war room, in whatever form that takes. In environments with a very high SLA target, the set up time for the war room alone can exceed the monthly allowance for downtime. So once you get into the war room, it’s serious business. The clock keeps ticking.
The war room is often a stressful place as a result. Tensions, and even tempers, can run high. But each and every person in the war room has an obligation to keep themselves under control. Tension and stress feed on each other, and if you succumb, it makes it worse for everyone. Even more important, if you get angry, emotional, or display some other decidedly not cool way of coping with the situation, you are likely to lose the confidence of your peers. That really sucks, so it’s important to force yourself to project calm confidence and be deliberate and measured in your actions. If you have the opportunity (and the ability) and can help to disperse some of that tension, do it. Those are the kinds of moments that build credibility.
And there is another very personal connection to the outage that can be stressful. A professional, worst-case scenario. In the worst case, you are actually responsible for the whole mess. When that happens, you are going to be in your own special hell in the war room. But you should know exactly what is expected of you – the one thing you can do that will really make a difference. You need to tell everyone exactly what you did, without skipping anything, and without regard to how stupid it might sound in hindsight. As in, “I continued to run through all the commands in the procedure even though they were failing with errors and even though I had guidance to stop and seek help if they failed.” To help prop you up in this most difficult of tasks, consider this: everyone else in the room has done the same thing. We’ve all been there. If we haven’t, it’s probably just a matter of time. So you have to swallow your pride, tell the whole story and then get actively involved in fixing things, because that’s what the war room is all about.
The goal of the war room is to restore service as quickly as possible
A war room is bad. If you are in one, it probably means that your customers are impacted. You have an obligation, working with the rest of the team, to recover the service in the most effective and least risky way. In a perfect world, that method is also fast, but life doesn’t always work out that way. If there is no obvious and simple solution, though, things can slow down a bit and we see different approaches forming across groups of people in the war room. Some, traditionally the operations staff, want a way to restore service quickly by a reboot, reset or other “big red button” kind of operation. Others feel it’s important to fully understand the nature of the problem before taking action. These are clearly often opposing goals.
One way this manifests in war rooms is when, in order to further analyze the problem, someone asks to run some experiments, or capture data, or otherwise use the production environment to debug the problem. If there is no method available to recover service with a reasonable level of risk, such experiments may be justified. Otherwise, the team has an obligation to execute whatever procedure they have to bring the system back up, even if that destroys information that might be helpful to the root cause investigation.
Another thing I see happen frequently is when the someone jumps the gun in trying to fix things. It plays out like this:
Jane: Do you think we should reboot the server?
Dick: I’ll get set up for it.
Jane: I think we should do it.
A good war room manager will easily spot and work through these kinds of conflicts. If there is such a person on your war room, listen to them and obey them. The war rooms I’ve participated in are not democracies. They are run by a strong leader who is empowered to make extremely critical decisions. It’s crucial to respect that authority in the heat of the moment rather than try to be the hero. Not so coincidentally, this is my next point.
We need solutions, not heroes
In a war room, you need to separate your ego from your idea. The war room does not need heroes. The war room needs solutions. Be advised that no matter what part of the organization you come from, your idea is probably going to go up against some alternative ideas. They will not be judged, generally speaking, on how clever or elegant they are. They will be judged on how effectively they can recover the service and how much risk there is. And if yours is not the chosen idea, it does not mean that you suck. The only people who really suck in a war room are those who interfere with the process of service recovery. You know, the people who interrupt the flow of a critical conversation to ask for a status update, or who won’t stop debating a relatively minor point when the team really needs to focus on a larger issue. That kind of thing. But I digress.
Sometimes, the right path forward does not actually address the root cause of the problem.
Sometimes, the right path forward precludes ever figuring out what the root cause is.
Heroes have a hard time coping with that.
You must accept that you may never know why
The incident and problem management processes in most reasonably mature organizations are designed to recover the service, find the root cause and then ensure that it never happens again. But reality is often different than these goals. Incidents sometimes last a lot longer than you like, as when you discover that something has been broken for a long time and you did not know about it. And we should all know by this point that sometimes you never really know why it happened. You just can’t figure it out. It sucks, but again, it doesn’t mean that you suck. It means that the cost of determining the root cause has crossed the point at which it is feasible to continue, given the impact of the problem when it manifests. That’s just common business acumen, but it’s worth stating. Because when you realize that this is a possible outcome, it can give you a better perspective on evaluating your options. Suddenly, restarting the system and walking away from any run-time state information that might be helpful in root cause determination seems more viable. And those are the kinds of trade-offs that everyone in the war room has to come to terms with.
It sucks being called into a war room due to an outage. It is the very definition of failure in the SaaS world. But you can make a bad situation into a great and positive experience if you handle it properly. You can take an event that should mean that you suck and turn it into something that demonstrates your excellence. Being an effective member of the war room is a way for you, as an individual contributor, to demonstrate that awesomeness. You do it through calm, reasoned action and sound leadership. That builds confidence and trust. I guarantee you that your peers in the war room will notice and respect you for it and your customers will benefit from it. Let’s not let any of them down.