DevOps.com

  • Latest
    • Articles
    • Features
    • Most Read
    • News
    • News Releases
  • Topics
    • AI
    • Continuous Delivery
    • Continuous Testing
    • Cloud
    • Culture
    • DevSecOps
    • Enterprise DevOps
    • Leadership Suite
    • DevOps Practice
    • ROELBOB
    • DevOps Toolbox
    • IT as Code
  • Videos/Podcasts
    • DevOps Chats
    • DevOps Unbound
  • Webinars
    • Upcoming
    • On-Demand Webinars
  • Library
  • Events
    • Upcoming Events
    • On-Demand Events
  • Sponsored Communities
    • AWS Community Hub
    • CloudBees
    • IT as Code
    • Rocket on DevOps.com
    • Traceable on DevOps.com
    • Quali on DevOps.com
  • Related Sites
    • Techstrong Group
    • Container Journal
    • Security Boulevard
    • Techstrong Research
    • DevOps Chat
    • DevOps Dozen
    • DevOps TV
    • Digital Anarchist
  • Media Kit
  • About
  • AI
  • Cloud
  • Continuous Delivery
  • Continuous Testing
  • DevSecOps
  • DevOps Onramp
  • Practices
  • ROELBOB
  • Low-Code/No-Code
  • IT as Code
  • More
    • Application Performance Management/Monitoring
    • Culture
    • Enterprise DevOps

Home » Features » Shedding Light On Toil: Ways Engineers Can Reduce Toil

Intel implementations toil site reliability engineering

Shedding Light On Toil: Ways Engineers Can Reduce Toil

By: Bill Doerrfeld on May 2, 2022 Leave a Comment

It comes in many shapes and sizes and is embraced as a necessary evil. It lurks in the shadows, emerging now and again to stealthily creep into our workflows, where it feasts on our perceived shame. If not pruned, it grows and wraps its tendrils around the workforce, draining the energy out of every digital worker across departments. We all have it and most of us believe we suffer alone.

I’m talking, of course, about toil.

AppSec/API Security 2022

Dramatics aside, toil is common throughout most programming and DevOps positions, whether we like it or not. And when it comes to platform engineering, the chances of encountering toil is high. You can think of toil as those tedious workarounds that should be automated but aren’t. This could be due to a lack of standard configurations for deployments. Or, perhaps engineers must copy and paste data from one module to the other—an integration that has not yet been automated.

One of the hallmarks of the SRE role is to spot and reduce such toil—hopefully before it gets out of hand. But, what exactly is toil? And, how can SREs reduce it?

I recently met with Jake Englund, senior site reliability engineer, Blameless, and Matt Davis, intuition engineer, Blameless, to uncover some tips on spotting and eliminating toil in the daily life of an SRE. Below, we’ll attempt to define toil and explore some techniques to defeat this ghoulish foe.

First, What is Toil?

Toil is tedious. Englund defines toil as repetitive manual tasks that are, ideally, automatable. “It’s not gratifying work; it lacks enduring value.” James Curtis similarly described toil as “tedious actions that really have no enduring value.” Toil is similar to technical debt in that, if left unchecked, it can hinder productivity and contribute to burnout.

Toil is invisible. Toil is the invisible work that goes on behind the scenes, says Davis. For example, at your team meeting or stand-up you might say something like, “I’m deploying this fix.” There’s usually no need to explain the nitty-gritty details of your exact step-by-step process, as it would be too far in the weeds. As a result, manual and repetitive tasks are often not given the spotlight.

Toil is persistent. Toil might be readily apparent, but it’s not as simple to eliminate; attempts at fixing toil could potentially create new toil. Tedious tasks may also be essential to keeping core infrastructure pieces up-to-date. According to Englund, what often ends up happening is that only the specific engineer knows the niche mechanics of what they’re maintaining, making them irreplaceable—a power that some employees might abuse. “If you’re the only one who knows how to do something and you’re hoarding that information, it leads to inertia against resolving toil,” said Englund.

Toil is pervasive. Toil is not limited to SRE or engineering. From UX to management or marketing, within most positions, you’ll encounter a degree of toil. This is because the ‘work as prescribed; is often much different than ‘work as done’, said Davis.

Four Tips To Reduce Toil

So, what can organizations do to conquer the nightmarish reality that is toil? Especially those that don’t have a dedicated engineering team to reduce drudgery?

1. Find It!

To reduce toil, you first have to find it. Sounds simple enough, right? Well, discovering toil might not be as obvious as it sounds. “You get into these toil situations where you might not even know you’re in it. How do you know that toil is encompassing what you’re doing?” asked Davis.

To find toil, the Blameless team has a weekly session open to the whole company to examine ‘work as done.’ According to Davis, this has been very beneficial for discovering where toil occurs. It also helps garner perspectives from roles throughout the entire company.

2. Standardize and Automate

Next, a proven method to reduce toil is standardizing your engineering processes. Standardization, said Davis, is “another word for removing ambiguity.” For example, SREs can benefit from having a common series of steps to follow when responding to alerts. Runbooks that aggregate data to respond to issues can reduce toil in the data collection process while also helping improve mean-time-to-discovery and recovery rates.

Standardizing change management is another way to reduce toil. If engineers are approaching change management in various ways, this could cause additional review and slow down the release process. Removing ambiguity from the change lifecycle can alleviate toil down the line.

When you find yourself having to retain mental models for configurations, it’s a good sign that standardization is required, explained Englund. For example, perhaps some configurations are needed in some environments but are commented out in others. Documenting the specific configurations for different development environments (and automatically populating configurations) is one way to reduce toil and the chance for human error.

3. Proactively Monitor

More proactive monitoring is another way to reduce toil, according to Englund and Davis. “Responding to a crash loop is responding too late,” added Davis. Instead, he advocated that SREs look toward leading indicators that suggest the potential for failure so that teams can make adjustments well before anything drastic occurs.

If SLIs like error rate and latency are getting bad, you must take reactive measures to fix them, causing more toil. Instead, proactive monitoring is best to see the cresting wave before the flood. Leading indicators could arise from following things like data queue operations connected to servers or the saturation of a particular resource. “If you can figure out when you’re about to fail, you can be prepared to adapt,” said Davis.

4. Practice How to Improvise

One major caveat of standardization is that you’re inevitably going to encounter edge cases that require flexibility. And when an outage or issue does arise, the remediation process is often very unique from case to case. As a result, not all investment into standardization pays out.

Alternatively, teams that know how to improvise together are proven to be better equipped for unforeseen incidents—one study on Arctic Sea emergencies found collective improvisation to be imperative for emergency response situations. Similarly, collaboration is vital for software teams. Davis calls this the “practice of practice.” If you don’t practice backup restores, for example, teams may flounder in the case of a database outage. Yet, since every remediation effort is unique, it really boils down to learning how to collaborate effectively as a team. “We can gain a lot from getting together to practice,” added Davis.

All the Small Things

Toil is like manually typing an email signature every time you are about to hit “Send.” Maybe you’ve just been putting off configuring your signature in your email client. Or, perhaps you like to write a custom sign-off for each recipient. Regardless of the reason, the longer this repetitive motion is required, it’s a stick in the spokes of productivity.

In the grand scope of things, this tedium can be a drain on time and resources. For every extra keystroke or click, every un-automated workflow or every copied-and-pasted code block, a small degree of toil is piled onto the daily workload of an engineer, dragging them away from focusing on what really matters.

Standardization, automation, proactive monitoring and collaboration are all helpful elements to consider on your journey toward reducing toil. But of course, there is the trap of increased software development automation—the time saved building automation must outweigh the upfront development cost (and ongoing maintenance effort). Otherwise, you might end up creating even more toil.

Recent Posts By Bill Doerrfeld
  • Open Standards Are Key For Realizing Observability
  • Leverage Empirical Data to Avoid DevOps Burnout
  • What Are the Seven Layers of the OSI Model?
More from Bill Doerrfeld
Related Posts
  • Shedding Light On Toil: Ways Engineers Can Reduce Toil
  • Don’t Let Developer Toil Affect the Business Value of Your Apps
  • SRE vs. DevOps — a False Distinction?
    Related Categories
  • DevOps Culture
  • DevOps Practice
  • Features
    Related Topics
  • automation
  • devops culture
  • SRE
  • technical debt
  • toil
Show more
Show less

Filed Under: DevOps Culture, DevOps Practice, Features Tagged With: automation, devops culture, SRE, technical debt, toil

Sponsored Content
Featured eBook
The State of Open Source Vulnerabilities 2020

The State of Open Source Vulnerabilities 2020

Open source components have become an integral part of today’s software applications — it’s impossible to keep up with the hectic pace of release cycles without them. As open source usage continues to grow, so does the number of eyes focused on open source security research, resulting in a record-breaking ... Read More
« The Road to Kubernetes
Self-Service Helps Devs Solve Cloud Security and Compliance »

TechStrong TV – Live

Click full-screen to enable volume control
Watch latest episodes and shows

Upcoming Webinars

Transforming the Database: Critical Innovations for Performance at Scale
Tuesday, August 23, 2022 - 1:00 pm EDT
Modern Data Protection With Metallic DMaaS: Hybrid, Kubernetes and Beyond
Wednesday, August 24, 2022 - 11:00 am EDT
DevOps Unbound: Report on AI-Augmented DevOps
Tuesday, August 30, 2022 - 11:00 am EDT

Latest from DevOps.com

Postman Survey: Increased Usage of APIs Results in More Security Incidents
August 19, 2022 | Mike Vizard
Free Dev Tools! But What’s the Catch?
August 19, 2022 | Sharon Sharlin
Unstructured Data Management: Avoiding Insider Knowledge Gaps
August 19, 2022 | Scotty Calkins
Agile Sucks (Redux) | Plus: DevOps on Mars
August 18, 2022 | Richi Jennings
Survey Shows Steady DevSecOps Progress
August 18, 2022 | Mike Vizard

GET THE TOP STORIES OF THE WEEK

Download Free eBook

Hybrid Cloud Security 101
New call-to-action

Most Read on DevOps.com

Techstrong TV: Scratching the Surface of Testing Through AI
August 12, 2022 | Alan Shimel
Building a Platform for DevOps Evolution, Part One
August 16, 2022 | Bob Davis
5 Ways to Prevent an Outage
August 15, 2022 | Ashley Stirrup
Techstrong TV: Leveraging Low-Code Technology with Tools �...
August 15, 2022 | Mitch Ashley
The Rogers Outage of 2022: Takeaways for SREs
August 15, 2022 | JP Cheung

On-Demand Webinars

DevOps.com Webinar ReplaysDevOps.com Webinar Replays
  • Home
  • About DevOps.com
  • Meet our Authors
  • Write for DevOps.com
  • Media Kit
  • Sponsor Info
  • Copyright
  • TOS
  • Privacy Policy

Powered by Techstrong Group, Inc.

© 2022 ·Techstrong Group, Inc.All rights reserved.