5 Ways to Reduce DevOps Toil

Over the last several years, DevOps has become a bit of a buzzword. It has become simultaneously a practice, a culture, a team, a job title and a vendor product. You can hire some DevOps, buy some DevOps, adopt DevOps and sprinkle a little bit of DevOps on top for good measure.

But, at its core, DevOps is about service ownership: having a single team with end-to-end ownership of software running in production. You should not have one team building software and another team deploying and operating it. That’s slow. DevOps is literally combining development and operations into a single consolidated team. DevOps means embracing the mantra of “you build it, you own it.”

The primary benefit of DevOps is increased speed and agility of engineering teams. With DevOps, exactly one team is responsible for a given piece of production software. There are no handoffs or knowledge silos. Practices like continuous delivery can emerge. Features can be put in front of your customers’ eyes with a shorter turnaround time. A secondary benefit is increased reliability and security. With “you build it, you own it,” it is no longer someone else’s job to ensure that software is correct or performant or reliable or secure. These responsibilities are “shifted left” onto exactly one team.

So, where do things go off the rails?

Service ownership and DevOps mean giving teams the agency to make whatever changes are necessary to their software. That can and will come at the expense of product functionality at times. So, the first place things go off the rails is when there is not a healthy balance between product needs and engineering needs. If teams are constantly working on new features but do not have the capacity to fix operational issues that are affecting them, product quality will eventually suffer. Ultimately, product-engineering imbalance generally comes from misaligned incentives or poorly defined goals from management.

The second place where things go off the rails is insufficient training. Especially with organizations early in their journey to adopting DevOps, which may have separate development and operations teams, there is a lot of training on new technologies and processes. Developers need to learn about observability, responding to incidents and debugging production applications. Operators and sysadmins need to learn about coding and the principles of software design. These are new skills for both camps and investment in training and mentorship is required.

What are some best practices for keeping DevOps on track?

There are lots of these best practices. It really depends on your starting point as an engineering organization, but here are five important ones:

1. Track Service Ownership

If DevOps is about “you build it, you own it,” you need an authoritative list of a) what’s running in production and b) which team owns it. If a piece of software goes down, you do not want to find out that the last person to work on it was the intern who left three years ago. Many folks start tracking this kind of information in spreadsheets or wikis, but keeping the data up-to-date becomes challenging. SaaS service catalogs have emerged as the easiest way to automatically keep this data accurate, complete and consistent.

2. Adopt Continuous Integration, Continuous Delivery and Feature Flags

Committing small changes multiple times per day is superior to release trains where one large change is committed on an infrequent basis. CI/CD allows for new features and improvements to be tested faster and put in front of customers more quickly. Feature flag tools let you easily test new functionality and product hypotheses for different segments of customers.

3. Embrace Incident Management

Software breaks sometimes. Having a robust process for how to triage and respond to both alerts and incidents is important. Within the context of DevOps, the team that owns a piece of software should be responsible for being on-call to respond to associated incidents. If the software breaks or behaves in some unexpected way, the team is expected to fix it. The team is also responsible for identifying what happened and investing in ways to ensure the same error does not occur again.

4. Treat Everything as Code, Then Automate all the Things

Applications and microservices? That’s code. Infrastructure? That’s code, too. Security policies? Also code. Deployments? You guessed it–code.

Treating everything as code allows for processes like “code review” to emerge beyond just your applications. You can have multiple eyes on every change, which catches bugs sooner and spreads knowledge to others. As well, version control systems (like Git) are a natural audit log, allowing anyone to see what was changed, when, and by whom.

With everything defined as code, the next step is to invest in automation. How many steps does it take for a developer to deploy new code to production? Or to spin up new infrastructure? Or to hook up monitoring or error tracking tools?

Your goal should be “one step:” run a command. It can be a shell script, a Slack bot or a button with a nice web UI. But it should always be one step. If there are more, you’re now in “manual steps” territory, and that’s brittle. You have no guarantees that successive individuals will run whatever sequence of steps in the same order, which can lead to divergence in how various parts of production are set up.

Another example is maintaining a list of a) what’s running in production and b) who owns it. Manually keeping that list up-to-date is impossible as the size of your production infrastructure or engineering team scales. You’re better off investing in service catalog tooling that can automatically integrate and capture this information.

5. Invest in Reducing Toil

Service ownership and DevOps are two-way streets between management and their engineering teams. On one hand, consolidated ownership will lead to faster delivery and more reliable software. That’s good for management. But, on the other hand, teams require actual empowerment and agency to make decisions about their software. Part of that is having capacity on their roadmaps to make improvements to issues that are causing toil.

For example, if a particular service is paging a lot or experiencing load issues, service ownership means giving the owning team the time to make any necessary changes. That will come at the short-term expense of new production functionality.

There’s no free lunch on this last point. Engineering management can’t just combine dev and ops responsibilities and magically expect results. Service ownership is a two-way street. Otherwise, toil will kill your teams in the long term.