So Many DevOps Tools, So Little Time

I’ve received thousands of emails, ads and sales pitches for tools for every stage of the app life cycle, from automating infrastructure buildout to release planning to building and testing, to delivering applications. If you’re building or rearchitecting an application, you have a surfeit of choice of DevOps tools.

Ironically, when an organization invests in various DevOps tools, it can lead to disconnected teams that are siloed by tooling. Gartner calls this “disconnected islands of automation.” In addition to a culture problem, it can lead to major incidents and downtime.

You can’t tool yourself into DevOps. But here’s how to make sure you’re not hurting your team with the wrong tools.

Step 1: Process Before Tools

Find out what problems you need to solve, then find tools to fit that need. It sounds simple, but many companies do the opposite: they’ve already picked the tool, and then use it to attack every problem.

“If all you have is a hammer, everything looks like a nail.” If all you have is one or two main tools, every problem gets fixed by those tools. Our team has been guilty of this, too. When you get really good at one tool, such as Puppet or Chef, the answer to every question is inevitably Puppet or Chef.

For our team, the fix has been to create process and architecture diagrams without tool names. Instead, we describe what needs to happen at each stage. If I know I’m building a long-running virtual machine, I know I need a configuration management component. Once we agree on the process, then we discuss tools.

Engineers are practical people who are trained to be constrained by the tools we use. To break out of this mindset, write out your desired code pipeline without tool names, even if you “know” the DevOps tools you’re going to use.

Step 2: Give Equal Priority to Infrastructure and Application Tools

A software company recently told me an unhappy story: The company spent years perfecting its continuous delivery pipeline and everything worked together like clockwork. Developers could test and run code in minutes.

The development team knew how to push code, but nothing was reachable in their region. Some of its delivery tooling was running in the same cloud region, so even though other regions were up, the tools the team wanted to use to perform failover weren’t available. When the team went to fail to DR, the configuration for its DR infrastructure had not been kept up to date with production, so instance sizes were not sufficient to run the application.

The moral of the story here is that your infrastructure and application need the same level of automation and agility. Even if you have a very mature CI/CD pipeline with gating, approvals, linting, etc., if you don’t have that same discipline and tooling around your infrastructure stack, then you’re equally down. You should have the ability to move your application from one region to another (and if you’re really ambitious, from one cloud to another). Also, if you’re not incorporating application development into infrastructure development, over time your infrastructure will grow out of date.

You should be asking questions such as:

What steps can we take to automate infrastructure provisioning?
How can we continually test failure?
What steps can we take to survive a platform failure?

Step 3. Integrate Infrastructure and Application Tools

The best way to survive a platform failure is to continually “practice” building out entirely new infrastructure and application stacks—as one continuous process, not as two different activities.

Here’s one extreme example. The team at Logicworks recently worked with a large enterprise software company that was launching an internal product with a custom deployment pipeline. The pipeline included dozens of parallel automated and manual tests, each in a separate development environment. When a new deployment to the development environment occurs, a new set of 45+ dev instances must be created that has the new version of code; when the test is finished, the instances must be terminated. This means that instances in their environment rarely last for longer than 24 hours, and over the course of a single week, hundreds of instances are terminated and rebuilt.

As you can imagine, this company has never experienced an infrastructure outage it couldn’t recover from, since it already recover from “failure” hundreds of times a week. This is the true meaning of immutable infrastructure: virtual instances are disposable and once you instantiate the infrastructure and code, you never change the instance. This way the infrastructure never strays from its initial, known-good state, operations are simplified and the system is so good at replacing itself that failure is a non-event.

Ideally, infrastructure automation and application automation are not different stacks with different teams, but the same stack with the same level of destructive testing, so that an advance in your application needs to be tested against the current state of your infrastructure.

Summary

The constant release of new DevOps tools and features can be overwhelming. But the biggest challenges for your DevOps team probably have nothing to do with CI/CD tools. I have found that it’s easy to overemphasize tools and deprioritize crucial (albeit less glamorous) automation components such as infrastructure templates and testing.

Your developers are constantly looking for new libraries and experimenting with new languages and tools for your applications. You need the same level of attention and testing in your full infrastructure and application stack. This will make you choose better application-level tools and prepare you for failure.

— Jason McKay