One thing Agile and DevOps definitely brought IT was a more accepting view of the whole “mistakes happen” mantra. Crashing systems is no longer a guarantee of a free ticket out the door. Indeed, off the top of my head, I can think of at least two cases where a CIO themselves checked in sloppy code and it trashed the system—CIOs at orgs big enough that they probably should not have been coding in the first place.
And that brings us to today’s blog. The standard response to “But what if there is a bad update?” is “We’ll just roll out a new version!” That works in some instances. In a lot of instances—most that involve GitOps as part of DevOps—it doesn’t. Implosions can be huge and take out chunks of infrastructure. So you need a better plan than just assuming you can fix it with a forward update. Do you have rollback capability? Are you making use of it, assuring all is set to roll back if needed? Testing to make sure rollback works with the other changes made across the system in this update?
That is part of the problem. Fixing with a new update is the best option, simply because in the age of massively distributed, microservices-based solutions, rolling back comes with a ton of baggage—enough that it may not be viable for you. Okay. So you can’t quickly roll forward. You can’t quickly roll back. Quick! What do you do?!
And I don’t have the answer to that question, because I’m not in your organization working on your systems. At some point, all IT is personal. And this is one of those points. You need to know what your best options are if an update spirals everything, and you need to have a plan to implement it. But what your best options are is not going to be the same as the next org. This is that point.
One option is to keep the ability to build the entire system. That’s a large setup, but every minute systems are down is hurting the company. And that’s what you need to plan for. Think of it as disaster planning for DevOps. A man-made disaster that destroys systems and infrastructure.
So, like any other disaster planning scenario, walk through the chain that makes the system work, identify weaknesses and list ways to address them. Then test those to make certain they do what you hope they will. Then, set up an automated system to keep this whole plan up-to-date.
In short, we’re in a new automated landscape, and we need a new automated tool to pull our rears out of the fire when the inevitable happens. Be it from dev error or malicious attacker, it is a safe bet that, sooner or later, you will have a massive systems outage that your DevOps toolchain can’t adequately address. Know what you are going to do. Or at least have thought about it, so you’re not just starting to think it through while coworkers and customers can’t access systems.
And keep rocking it. This is just another layer of protection for all the hard work you’re doing. Take the extra step. Like insurance, if you ever need it, you will absolutely be glad you did.