DevOps.com

  • Latest
    • Articles
    • Features
    • Most Read
    • News
    • News Releases
  • Topics
    • AI
    • Continuous Delivery
    • Continuous Testing
    • Cloud
    • Culture
    • DevSecOps
    • Enterprise DevOps
    • Leadership Suite
    • DevOps Practice
    • ROELBOB
    • DevOps Toolbox
    • IT as Code
  • Videos/Podcasts
    • DevOps Chats
    • DevOps Unbound
  • Webinars
    • Upcoming
    • On-Demand Webinars
  • Library
  • Events
    • Upcoming Events
    • On-Demand Events
  • Sponsored Communities
    • AWS Community Hub
    • CloudBees
    • IT as Code
    • Rocket on DevOps.com
    • Traceable on DevOps.com
    • Quali on DevOps.com
  • Related Sites
    • Techstrong Group
    • Container Journal
    • Security Boulevard
    • Techstrong Research
    • DevOps Chat
    • DevOps Dozen
    • DevOps TV
    • Digital Anarchist
  • Media Kit
  • About
  • AI
  • Cloud
  • Continuous Delivery
  • Continuous Testing
  • DevSecOps
  • Leadership Suite
  • Practices
  • ROELBOB
  • Low-Code/No-Code
  • IT as Code
  • More Topics
    • Application Performance Management/Monitoring
    • Culture
    • Enterprise DevOps

Home » Features » Shedding Light On Toil: Ways Engineers Can Reduce Toil

Intel implementations toil site reliability engineering

Shedding Light On Toil: Ways Engineers Can Reduce Toil

By: Bill Doerrfeld on May 2, 2022 Leave a Comment

It comes in many shapes and sizes and is embraced as a necessary evil. It lurks in the shadows, emerging now and again to stealthily creep into our workflows, where it feasts on our perceived shame. If not pruned, it grows and wraps its tendrils around the workforce, draining the energy out of every digital worker across departments. We all have it and most of us believe we suffer alone.

I’m talking, of course, about toil.

DevOps/Cloud-Native Live! Boston

Dramatics aside, toil is common throughout most programming and DevOps positions, whether we like it or not. And when it comes to platform engineering, the chances of encountering toil is high. You can think of toil as those tedious workarounds that should be automated but aren’t. This could be due to a lack of standard configurations for deployments. Or, perhaps engineers must copy and paste data from one module to the other—an integration that has not yet been automated.

One of the hallmarks of the SRE role is to spot and reduce such toil—hopefully before it gets out of hand. But, what exactly is toil? And, how can SREs reduce it?

I recently met with Jake Englund, senior site reliability engineer, Blameless, and Matt Davis, intuition engineer, Blameless, to uncover some tips on spotting and eliminating toil in the daily life of an SRE. Below, we’ll attempt to define toil and explore some techniques to defeat this ghoulish foe.

First, What is Toil?

Toil is tedious. Englund defines toil as repetitive manual tasks that are, ideally, automatable. “It’s not gratifying work; it lacks enduring value.” James Curtis similarly described toil as “tedious actions that really have no enduring value.” Toil is similar to technical debt in that, if left unchecked, it can hinder productivity and contribute to burnout.

Toil is invisible. Toil is the invisible work that goes on behind the scenes, says Davis. For example, at your team meeting or stand-up you might say something like, “I’m deploying this fix.” There’s usually no need to explain the nitty-gritty details of your exact step-by-step process, as it would be too far in the weeds. As a result, manual and repetitive tasks are often not given the spotlight.

Toil is persistent. Toil might be readily apparent, but it’s not as simple to eliminate; attempts at fixing toil could potentially create new toil. Tedious tasks may also be essential to keeping core infrastructure pieces up-to-date. According to Englund, what often ends up happening is that only the specific engineer knows the niche mechanics of what they’re maintaining, making them irreplaceable—a power that some employees might abuse. “If you’re the only one who knows how to do something and you’re hoarding that information, it leads to inertia against resolving toil,” said Englund.

Toil is pervasive. Toil is not limited to SRE or engineering. From UX to management or marketing, within most positions, you’ll encounter a degree of toil. This is because the ‘work as prescribed; is often much different than ‘work as done’, said Davis.

Four Tips To Reduce Toil

So, what can organizations do to conquer the nightmarish reality that is toil? Especially those that don’t have a dedicated engineering team to reduce drudgery?

1. Find It!

To reduce toil, you first have to find it. Sounds simple enough, right? Well, discovering toil might not be as obvious as it sounds. “You get into these toil situations where you might not even know you’re in it. How do you know that toil is encompassing what you’re doing?” asked Davis.

To find toil, the Blameless team has a weekly session open to the whole company to examine ‘work as done.’ According to Davis, this has been very beneficial for discovering where toil occurs. It also helps garner perspectives from roles throughout the entire company.

2. Standardize and Automate

Next, a proven method to reduce toil is standardizing your engineering processes. Standardization, said Davis, is “another word for removing ambiguity.” For example, SREs can benefit from having a common series of steps to follow when responding to alerts. Runbooks that aggregate data to respond to issues can reduce toil in the data collection process while also helping improve mean-time-to-discovery and recovery rates.

Standardizing change management is another way to reduce toil. If engineers are approaching change management in various ways, this could cause additional review and slow down the release process. Removing ambiguity from the change lifecycle can alleviate toil down the line.

When you find yourself having to retain mental models for configurations, it’s a good sign that standardization is required, explained Englund. For example, perhaps some configurations are needed in some environments but are commented out in others. Documenting the specific configurations for different development environments (and automatically populating configurations) is one way to reduce toil and the chance for human error.

3. Proactively Monitor

More proactive monitoring is another way to reduce toil, according to Englund and Davis. “Responding to a crash loop is responding too late,” added Davis. Instead, he advocated that SREs look toward leading indicators that suggest the potential for failure so that teams can make adjustments well before anything drastic occurs.

If SLIs like error rate and latency are getting bad, you must take reactive measures to fix them, causing more toil. Instead, proactive monitoring is best to see the cresting wave before the flood. Leading indicators could arise from following things like data queue operations connected to servers or the saturation of a particular resource. “If you can figure out when you’re about to fail, you can be prepared to adapt,” said Davis.

4. Practice How to Improvise

One major caveat of standardization is that you’re inevitably going to encounter edge cases that require flexibility. And when an outage or issue does arise, the remediation process is often very unique from case to case. As a result, not all investment into standardization pays out.

Alternatively, teams that know how to improvise together are proven to be better equipped for unforeseen incidents—one study on Arctic Sea emergencies found collective improvisation to be imperative for emergency response situations. Similarly, collaboration is vital for software teams. Davis calls this the “practice of practice.” If you don’t practice backup restores, for example, teams may flounder in the case of a database outage. Yet, since every remediation effort is unique, it really boils down to learning how to collaborate effectively as a team. “We can gain a lot from getting together to practice,” added Davis.

All the Small Things

Toil is like manually typing an email signature every time you are about to hit “Send.” Maybe you’ve just been putting off configuring your signature in your email client. Or, perhaps you like to write a custom sign-off for each recipient. Regardless of the reason, the longer this repetitive motion is required, it’s a stick in the spokes of productivity.

In the grand scope of things, this tedium can be a drain on time and resources. For every extra keystroke or click, every un-automated workflow or every copied-and-pasted code block, a small degree of toil is piled onto the daily workload of an engineer, dragging them away from focusing on what really matters.

Standardization, automation, proactive monitoring and collaboration are all helpful elements to consider on your journey toward reducing toil. But of course, there is the trap of increased software development automation—the time saved building automation must outweigh the upfront development cost (and ongoing maintenance effort). Otherwise, you might end up creating even more toil.

Recent Posts By Bill Doerrfeld
  • Does GraphQL Introduce New Security Risks?
  • Smoothing the Transition From REST to GraphQL
  • What’s the Difference Between SLI, SLA and SLO?
More from Bill Doerrfeld
Related Posts
  • Shedding Light On Toil: Ways Engineers Can Reduce Toil
  • SRE vs. DevOps — a False Distinction?
  • Day in the Life of a Site Reliability Engineer (SRE)
    Related Categories
  • DevOps Culture
  • DevOps Practice
  • Features
    Related Topics
  • automation
  • devops culture
  • SRE
  • technical debt
  • toil
Show more
Show less

Filed Under: DevOps Culture, DevOps Practice, Features Tagged With: automation, devops culture, SRE, technical debt, toil

Sponsored Content
Featured eBook
Hybrid Cloud Security 101

Hybrid Cloud Security 101

No matter where you are in your hybrid cloud journey, security is a big concern. Hybrid cloud security vulnerabilities typically take the form of loss of resource oversight and control, including unsanctioned public cloud use, lack of visibility into resources, inadequate change control, poor configuration management, and ineffective access controls ... Read More
« The Road to Kubernetes
Self-Service Helps Devs Solve Cloud Security and Compliance »

TechStrong TV – Live

Click full-screen to enable volume control
Watch latest episodes and shows

Upcoming Webinars

Modernizing Jenkins Pipelines With CD Automation
Tuesday, May 17, 2022 - 11:00 am EDT
Applying the 2022 OSSRA Findings to Software Supply Chain Risk Management
Tuesday, May 17, 2022 - 1:00 pm EDT
Getting Mainframe and IBM i Data to Snowflake
Tuesday, May 17, 2022 - 3:00 pm EDT

Latest from DevOps.com

Why Over-Permissive CI/CD Pipelines are an Unnecessary Evil
May 16, 2022 | Vladi Sandler
Why Data Lineage Matters and Why it’s so Challenging
May 16, 2022 | Alex Morozov
15 Ways Software Becomes a Cyberthreat
May 13, 2022 | Anas Baig
Top 3 Requirements for Next-Gen ML Tools
May 13, 2022 | Jervis Hui
Progress Expands Scope of Compliance-as-Code Capabilities
May 12, 2022 | Mike Vizard

Get The Top Stories of the Week

  • View DevOps.com Privacy Policy
  • This field is for validation purposes and should be left unchanged.

Download Free eBook

The State of Open Source Vulnerabilities 2020
The State of Open Source Vulnerabilities 2020

Most Read on DevOps.com

Agile/Scrum is a Failure – Here’s Why
May 10, 2022 | Richi Jennings
How Waterfall Methodologies Stifle Enterprise Agility
May 12, 2022 | Jordy Dekker
How to Secure CI/CD Pipelines With DevSecOps
May 11, 2022 | Ramiro Algozino
Update Those Ops Tools, Too
May 11, 2022 | Don Macvittie
The COVID-19 Pandemic’s Lasting Impact on Tech
May 11, 2022 | Natan Solomon

On-Demand Webinars

DevOps.com Webinar ReplaysDevOps.com Webinar Replays
  • Home
  • About DevOps.com
  • Meet our Authors
  • Write for DevOps.com
  • Media Kit
  • Sponsor Info
  • Copyright
  • TOS
  • Privacy Policy

Powered by Techstrong Group, Inc.

© 2022 ·Techstrong Group, Inc.All rights reserved.