I’m going to take a side trek today into home computing with applicability to IT. Bear with me.
We maintain a pretty complex home network, as home networks go. With three NAS servers, a small one-box SAN, some web servers and connections to several hosted environments. We both work from home, and my business is contained in one of the NAS servers and some of those hosted environments, so it’s all necessary. Scheduling downtime with several people using the network darned near 24/7 is difficult. Not medium enterprise level of difficult, but still an issue.
Two of our NAS servers each have a disk out. It happens, that’s why they’re NAS servers. In general, the only day that replacing the disks makes sense is Sunday, when things are a little slower (both in schedules and network usage) than normal. One of our web servers is just plain dead–but there’s info on there we have to get off. The code is mostly being replaced, but the data is collected over years and we need to see how much we can salvage.
None of these things are shiny DevOps things. Yet, they still have to get done. Importantly, they have to get done when automated tools aren’t running across the network or counting on the NAS to be responsive. The rush isn’t huge, because there are off-site backups, but restoring a remote backup is a long-term proposition–weeks, not days in our current environment–so it must happen as soon as practicable.
Don’t forget that with the drive to automate everything. If you own the hardware, it’ll need maintenance. If you own the software, same thing. If you need data to move platforms, it will take time. All things we kind of hand-wave in the DevOps world. They’re seen as bottlenecks, if seen at all. Generally, we don’t even consider maintenance issues until they impact DevOps, but we know they are going to happen so we should have a plan for how to deal with them.
Those few of you whose entire environment is hosted in a cloud provider have an easier job of it, but eventually you will be asked to move cloud providers. As sure as the sun will rise tomorrow, something will drive a move away from your chosen platform. Normally TCO, but other factors play in, too. When that happens, the work will be more than “Find time for the NAS to rebuild without overly impacting everything else.”
So, spend some time thinking about it. Do you have plans to make room for maintenance? Are you scheduling maintenance into release cycles? Are you coordinating releases with updates that are not required by this specific release?
This is where dev (typically Agile/Scrum) meets ops (typically system/network maintenance), and it is every bit as important as what is normally seen as DevOps. Get it right and have fun, keep cranking stuff out. Get it wrong and struggle with the problems infrastructure issues can inflict upon your cool new update. One of our new websites outright stopped development when the web server went down–for a variety of reasons, but the catalyst was the loss of the original. We’re working out what to do about that.
Meanwhile, we didn’t make time yesterday, so I’ll be replacing disks and tearing apart a web server in a much smaller window that is available today. See you when I poke my head back up. Hopefully, you’ll have better planning than we did.