A big tenet of DevOps is, “Automate all of the things.” It’s generally a good policy, and definitely allows us to meet the delivery goals of DevOps. Taking the full-on, all-new technology case, an org could run full automation on the dev side and deploying with an application release automation tool to a fully automated container management system—fast, efficient, programmable, even portable. This level of automation absolutely enables DevOps.
Until things break.
Ignore the well-known and easily overcome cases such as, “That container is no longer responding,” and focus on the larger infrastructure environment. Do you have a plan for getting builds out if DevOps has taken you to many builds a day and your entire Jenkins environment just crashed? The easy answer is, “Switch to another Jenkins server,” but if you are big enough to have more than one, workload may be an issue, as might be the Jenkins environment itself. The usual example of Jenkins for client builds and separate Jenkins environment for server builds is a good example. If the client environment drops, you can’t just move it to the server build environment. Test tools and other tools for clients will have to be moved over, and almost certainly the build environment will have to be modified.
Along those same lines: What if, instead of a few containers dropping, your container management system fails? (It’s software, it will happen.) What’s the plan then? Depending on the type of failure, it could take your entire application or application portfolio down with it. In most organizations, downtime is not something taken lightly, so what is the plan to return to normal function? Waiting until you’re in the middle of a disaster caused by a single bad disk, failed memory chip or an attack exploiting the infrastructure is a bad plan. We’ve been doing this for decades, yet we know software fails, and sometimes it fails spectacularly.
Once the first rush of DevOps is through, processes are streamlined and automation is basically in place, it’s fair to take a deep breath and plan for how to overcome the break and total failure of each tool in your toolchain. Yes, those tools will change and you’ll have to do a little work each time you add/change one, but what is your preference: a slightly slower adoption of DevOps after the low-hanging fruit is gathered or confront a failure of infrastructure proportions without a plan?
Long before computers, Benjamin Franklin is credited with having said, “By failing to prepare, you are preparing to fail.” That credo has been repeated often since. There is truth in that statement with any form of automation. That which is automated eventually goes awry. You can plan for it, or suffer the stress of trying to figure out what to do while it’s an emergency. I suggest having a break plan.