Imagine if you backed out of your driveway, tapped the gas and BAM! You are at work. Nothing to worry about along the way, no traffic congestion, no other cars to worry about, etc.
That is, unfortunately, how many of us approach automation. “Fire and forget” is real, until a problem crops up. I could talk about some complex environments, such as ARA, in a containerized environment with multiple languages, accessing mobility-limited databases and targeting several client platforms, for example.
But we’ll keep it simple, and I’ll make an admission. Backup was one area where we at Ingrained Tech fell into the fire-and-forget trap. One part of our backup system runs desktops out to a server, the server (and other servers) out to a NAS and the NAS out to the cloud. I don’t think our security peeps will freak out if I mention we use duplicity for this backup and Backblaze for our backup cloud. Backblaze is a dream to use compared to others we’ve looked at, and the pricing is competitive, so it suits our needs well.
A few weeks ago, we realized that while backups were completing successfully and verification was fine, it was taking nearly the entire week to do our weeklies. That’s bad news. We back up terabytes of data, but this is 2020, and our drive IO/Network can handle it. We started by looking at both of those things, and sure enough, it was drive IO causing our problems.
Unsurprisingly, reading the disk and creating a collection of encrypted tarballs (a super-simplification of what duplicity does, but gets the point across) on those same disks was causing problems. After investigating, the one build that occurs on that server was suffering from backups stretching out, too. We had seen occasional issues with the build, but nothing jumped out as the source, and it was only one night every month or two. It was on our list. You know, the same prioritized list you have. And this build wasn’t nightly-critical. You get the idea.
This was a relatively urgent problem, as soon as last week’s backup would not be complete when this week’s tried to start. Not good.
We took our DevOps mentality to the problem and discovered a simple solution that solves the issues. The backup NAS is RAID 50, so it can handle a lot more abuse than the server that was slowing down backups. So, we took the simple expedient of rsyncing the source server to the NAS, then using duplicity to tar and encrypt before sending it on to the cloud buckets. Internally, we do not believe we need to encrypt backups because if someone is in our network, all of that data is laying around. By putting duplicity and both the tar function and the encryption function onto the backup NAS, we moved it to a far less busy system. Backup times dropped from days to hours instantly, and our unrelated build problems went away.
We have to rework the automation somewhat to make sure validation is, well, actually validating the data is correct. But we’re well on our way to having a stable, faster system.
How does this fit with DevOps? We see the same thing happening often in the automation rush of DevOps. Yesterday’s best solution is not necessarily tomorrow’s best solution. “Automate and watch” should be the catch-phrase. Improvement comes from both DevOps culture and paying attention to changes in the tools market.
As Alan pointed out, someone will eventually win the end-to-end DevOps rush—since I feel the urge to inject my opinion on who stands the best chances, expect a blog on this topic soon. Until then, knowing what your options are and how to best achieve your goals–even on problems that were “solved” last year–is your best path for continued improvement. Yes, it means rework, but just like reworking your code, the point is to streamline things, get closer to your organization’s goals and keep improving after the low-hanging fruits of DevOps have been gathered. The rate of improvement in several areas of DevOps is stunning, so you may find you actually like a new tool better than the one you chose a couple years ago. And remember, knowledge is power.