With all the buzz around DevOps we tend to assume that everyone is doing it. But the reality is that most companies are not. That’s because organizations using legacy, on-premises applications think that it can’t work for them, and, according to IDC, the vast majority of enterprise apps still reside on-premises. Even in 2018, less than 28 percent of the worldwide enterprise applications market is expected to be software-as-a-service (SaaS)-based. But even if your organization is weighed down by legacy apps in your data center, that doesn’t mean you can’t do DevOps in the cloud.
DevOps Requires Cattle
A good reference point for DevOps is the “cattle versus pets” analogy. That is to say, you should treat your servers as cattle, not pets: They all should get fed and treated equally so that all your test labs will be the same. The problem is, when it comes to test labs for on-premises apps, it seems as though everyone has pets, while only the new breed of SaaS apps are treated like cattle. This is because the labs are very complex and have a way of accumulating their own unique settings, prompting conversations such as:
“What is this 5000-line XML?!?”
“Oh, that was configured by Adam in 1999 and we copy it verbatim ever since.”
“Who’s Adam?”
“Oh, he hasn’t worked here for five years.”
Good DevOps practices require an organization to have multiple identical labs, available on-demand. You don’t get that with pet labs. Therein lies the problem.
The Dreaded Configuration Drift
Pet labs also lead to some other bad phenomena, such as configuration drift. Configuration drift results from the fact that it’s really hard to maintain even one lab correctly, so deploying and managing two or three or 20 labs becomes a daunting and very error-prone task. Since it’s easier, people often will simply copy images of other labs. But over time, those labs keep getting manually configured to account for various version needs and bug fixes. Hence, the “drift.”
One cure for drift is starting from scratch every time. But for those “pet” labs (like the one Adam, who no longer works here, configured himself and was the last person alive who knew how and why), starting from scratch every time is intimidating and impractical, so they just keep drifting along. That means even though your test labs are supposed to be exact duplicates, they aren’t.
‘It Works for Me’ and ‘Cannot Reproduce’ Syndromes
Inconsistencies between labs also lead to some common syndromes in bug tracking systems known as “works for me” and “cannot reproduce.”
Here’s how these play out: The QA team has its own set of labs (hopefully they aren’t individual snowflakes), as does each dev team, and there are customer environments—another type of beast. Now all hell breaks loose—some bugs appear only in QA and not in the customer or dev environments, while others are only at the customer level. Miraculously, everything always works in the developer lab because those labs haven’t been reinstalled since the era of vacuum tubes and they contain every registry hack, debug symbol and kernel patch known to mankind.
The “cannot reproduce” syndrome creates a lot of wasted time. When the QA person finds a bug, they talk to a developer, but developers are not in a bug-fixing sprint, and say they will get to it in a couple of weeks. So the labs move on. Meanwhile, two weeks later, a developer finally puts in a few hours but doesn’t find it and closes out the bug with “doesn’t reproduce.” Then the QA puts in another three hours trying to find it again.
The Serious Issue of Lab Scarcity
Last, but not least, physical in-house labs cause lab scarcity. Many companies find it too expensive and not cost-effective to build and maintain the number of labs they would need to enable all teams to work in parallel. With a limited number of labs, developers have to wait for others to finish their tasks before they can claim the lab for their own work. From my own experience, we had developers that wouldn’t leave the lab to take lunch breaks because they didn’t want to risk losing their place in the queue; they actually lost weight from not eating lunch all week (maybe a good thing for some?).
Lab scarcity is literally a DevOps killer, not to mention the consequence of wasted developer time. (One can argue that wait time for lab spent on Reddit isn’t truly wasted, but that’s another story.) The point is, you want a lab for every developer, for every tester and for every CI run. Disney World-level wait lines are not a DevOps enabler.
DevOps in the Cloud
The ultimate goal is to create identical, reproducible test environments for every test, tester and run that are elastic, burstable and present short feedback loop times. This is what brings you to cloud.
Using the cloud to replicate and share identical labs that can easily be started over and saved at a specific state is the way to avoid these problems. But not all public cloud providers are created equal, especially when it comes to re-creating complex, on-premises environments in the cloud.
Commodity cloud providers AWS, Google and Azure present one option. However, they have little to no support for uploading your images, no complex L2/L3 networking support, little to no nested virtualization and no promiscuous mode adapter. After all, they are built for “cloud-native” apps.
Another option is to build your own data center. But this is costly in terms of hardware (which repeats every two to four years) and fairly complicated to do properly, and involves lots of overhead for creating and maintaining them. Eventually, you’ll probably realize you’re not an infrastructure-as-a-service (IaaS) company and that you prefer to focus on your core business.
The best option is to look for something as convenient and agile as commodity cloud, but which is specialized for your on-premises challenges.
For many cases, you’ll find that specialized virtual IT lab providers let you replicate production server images in the cloud without modification, and also save templates of production labs regardless of complexity or number of nodes. These providers should enable you to quickly spin up as many identical environments as needed and then easily connect tools and scripts to automate processes across systems.
When it comes to debugging, make sure you can easily capture snapshots of issues to get back to the same spot later. Then look for automation capabilities that save time and control costs, such as environment policies, auto-suspend, Zapier integration, organizational structure modelling and more.
A Sample Use Case
Smaller startups often run into challenges because they want to run fast with very big systems. Take, for example, the case of a financial trading company that started out by investing in production instances and tweaking them manually. The system comprised more than a half-dozen subsystems, each with its own programming languages, technology stacks and deployment methodologies. Developers working on one subsystem had virtually no knowledge about the others, except for how the interfaces behaved. The setup worked in production (most of the time …), but then quality issues mounted.
The company first invested in a waterfall-like methodology with long QA cycles. But much of the QA happened in production because there was no complete staging system. And when you are a financial trading organization dealing with people’s money, doing QA on your production environment becomes costly in multiple ways.
The lack of a safety net in the form of a complete staging environment also slowed the deployment of new features. Even if a change was working nicely internally within the subsystem where it was made, it still would have to be orchestrated with the larger system—sort of like a composer who changes a note for one instrument then has to call all the players to hear it with the entire symphony.
The company eventually solved its issues by uploading complete images of its existing production machines to the cloud, then divided them into subsystems with different blueprints. Once the company established a “golden blueprint” of each subsystem, it became very easy to test by mixing and matching servers from different blueprints. To continue the composer analogy, the company could now change the notes in one subsystem and see how it sounds together by playing them together with recordings of all the other players.
The company ended up achieving a dramatic decline in the number of production issues, and significantly increased the speed at which it could develop new features (no doubt one runs faster on the high wire when a safety net exists).
Special Bonus for Docker Users
Snapshots also come in particularly handy when using Docker. In a Docker-based environment, current practice is to run CI tests on the actual Jenkins slaves in an ephemeral way, which disappears between consecutive runs. This means you only have logs of the end results. But what happens when there is a failure and you want to debug? The bug may appear to be non-deterministic when it is actually deterministic because something has changed between tests. And even if the bug is deterministic, you still have to re-create it. Imagine if instead of running tests on ephemeral Jenkins slaves, you spin up an ephemeral cloud-based Docker machine and, in case of test failure, you set your CI to take a snapshot. You can then go back to your snapshot later and employ traditional debugging techniques, saving precious developer time.
DevOps for All
Doing DevOps in the cloud can make all of this possible. No matter how old and complicated your architecture may have become, you can still get into the game. Just remember that cattle are easier to herd than cats.