In Determining DevOps issues, Sometimes It’s the Process

It is pretty much standard DevOps process that when a server instance starts having problems, you simply kill it and start another. This is in line with the idea that servers are cattle and there isn’t a ton of difference between them.

But it creates a problem that no amount of CI/CD or automated provisioning can overcome. That’s the blindness problem. CI and CD miss some bugs, simply because of the wild variations possible in inputs when humans are involved, or unique issues of hardware or platform that DevOps tries to ignore as much as possible. While our tooling cannot be so over-arching that we have data points on everything, and we can track down problems to the routine that is causing them, a service degrading performance or outright crashing is indicative of one of those blind spots.

If you’ve been in a highly dynamic DevOps environment, you know this is no simple problem. But it is one we have to resolve, because killing and restarting is simply masking a problem. Indeed, it is this very process that turned Cloudbleed from a simple and understandable programmer error into front page news. They knew their servers would occasionally have memory over-runs, even crashing on occasion; they would just spawn another instance and keep moving along. The ability to create another so easily reduced their desire/need to fix the underlying problem. But the problem was bigger than they knew.

The problem we face is, of course, how much time and work is invested in monitoring, and how do we know what’s really important. This problem gets worse when you have an instance that stops responding, or starts hogging resources. Getting it offline is imperative, but re-creating the issues that caused it to go off the rails may not be so easy. In fact, often it is not at all easy.

Long term, we need a way to pick up the exact problem area, create a bug report and get it back into the Dev/CI/CD system that is reliable. Short term, don’t ignore errors in production unless you know exactly what is going wrong.

Sure, it’s easy to kill and/or restart an instance to “recover” from an error. It’s easy to chuck your app on the public internet without any security, too. Just because it’s easy doesn’t make it a good idea.

So watch those logs, don’t mask problems and stay on top of it all, even if the complexity sometimes has you reeling a bit. Because application reliability might be impacted—and application availability is, in the end, the whole point.

— Don Macvittie