I had a rough week last week. We lost a server completely loaded with files and, of the three verified backups, precisely zero of them were valid. If I hadn’t set up the backup and verification process, I would swear that verification didn’t happen or that results were ignored. But I was involved in this one, so I know everything was done right. We do not yet have a root cause analysis done, but we are recovering all that we can to see what needs to be recreated.
This server did not have anything business-destroying like financial data on it, but it did have products–and all of the source code for those products. Restores are one of those things that never happen until they do. And when they do, most of us hope that all goes well. Some of you have that, “Our backup processes are perfect, and we don’t have to worry,” look on your faces. We thought that, too.
This was only one of several, “Oh yeah, forgot that even could be an issue …” items that all dumped on us last week. The fact is that, in the age of DevOps, we have automated the things that happen all the time or are mission-critical (like backups), but there are a ton of one-off things we’ve nearly forgotten about (like restores). That script that changes the permissions on that one folder so it’s right for the application and is only used when the data folder changes? Yeah, those types of things.
We have saved a ton of time in the entire development/deploy/secure/management process; let’s reinvest just a tiny bit of it in cleaning up those one-off items. I am of the belief that we will always have those items, as little critical bits are handled in this manner, customizing things to our organization’s needs. I’m not advocating eliminating these things, I’m advocating gathering them together, hunting them out and getting them into repositories. In fact, for big things, I am even suggesting that a container that can run, perform the task and shut down should be created and then put in the repository. And an inventory needs to be available.
Have you ever stopped and wondered how many hours we have wasted reinventing the wheel because of turnover? I mean, seriously, that complex bit we needed years ago, we need again. Can we grab a script and run it, or do we have to spend staff resources recreating it? Most of the time, we recreate it. Spend that time inventorying instead.
You all are owning the whole DevOps thing, and you should know it. Just about every organization is more streamlined and turning out better apps faster, and that’s all you. Let’s just pick up the corners. I lost my entire week last week over one-off things like that server, and while restores are going to happen, some of the other issues that cropped up could have been a script saved in the past. Of course, if the script was saved on the lost disk …
Keep rocking it, we’re all grateful.
P.S.: For the record, the 2000s-era Dell enterprise servers we have (six of them) just keep clicking along, doing what they do. The SME servers we’ve bought in the last 10 years? Lose motherboards or RAID cards about every four to five years (We have replaced three of them, and we only own two). Shop carefully, my friends.