It comes in many shapes and sizes and is embraced as a necessary evil. It lurks in the shadows, emerging now and again to stealthily creep into our workflows, where it feasts on our perceived shame. If not pruned, it grows and wraps its tendrils around the workforce, draining the energy out of every digital worker across departments. We all have it and most of us believe we suffer alone.
I’m talking, of course, about toil.
Dramatics aside, toil is common throughout most programming and DevOps positions, whether we like it or not. And when it comes to platform engineering, the chances of encountering toil is high. You can think of toil as those tedious workarounds that should be automated but aren’t. This could be due to a lack of standard configurations for deployments. Or, perhaps engineers must copy and paste data from one module to the other—an integration that has not yet been automated.
One of the hallmarks of the SRE role is to spot and reduce such toil—hopefully before it gets out of hand. But, what exactly is toil? And, how can SREs reduce it?
I recently met with Jake Englund, senior site reliability engineer, Blameless, and Matt Davis, intuition engineer, Blameless, to uncover some tips on spotting and eliminating toil in the daily life of an SRE. Below, we’ll attempt to define toil and explore some techniques to defeat this ghoulish foe.
First, What is Toil?
Toil is tedious. Englund defines toil as repetitive manual tasks that are, ideally, automatable. “It’s not gratifying work; it lacks enduring value.” James Curtis similarly described toil as “tedious actions that really have no enduring value.” Toil is similar to technical debt in that, if left unchecked, it can hinder productivity and contribute to burnout.
Toil is invisible. Toil is the invisible work that goes on behind the scenes, says Davis. For example, at your team meeting or stand-up you might say something like, “I’m deploying this fix.” There’s usually no need to explain the nitty-gritty details of your exact step-by-step process, as it would be too far in the weeds. As a result, manual and repetitive tasks are often not given the spotlight.
Toil is persistent. Toil might be readily apparent, but it’s not as simple to eliminate; attempts at fixing toil could potentially create new toil. Tedious tasks may also be essential to keeping core infrastructure pieces up-to-date. According to Englund, what often ends up happening is that only the specific engineer knows the niche mechanics of what they’re maintaining, making them irreplaceable—a power that some employees might abuse. “If you’re the only one who knows how to do something and you’re hoarding that information, it leads to inertia against resolving toil,” said Englund.
Toil is pervasive. Toil is not limited to SRE or engineering. From UX to management or marketing, within most positions, you’ll encounter a degree of toil. This is because the ‘work as prescribed; is often much different than ‘work as done’, said Davis.
Four Tips To Reduce Toil
So, what can organizations do to conquer the nightmarish reality that is toil? Especially those that don’t have a dedicated engineering team to reduce drudgery?
1. Find It!
To reduce toil, you first have to find it. Sounds simple enough, right? Well, discovering toil might not be as obvious as it sounds. “You get into these toil situations where you might not even know you’re in it. How do you know that toil is encompassing what you’re doing?” asked Davis.
To find toil, the Blameless team has a weekly session open to the whole company to examine ‘work as done.’ According to Davis, this has been very beneficial for discovering where toil occurs. It also helps garner perspectives from roles throughout the entire company.
2. Standardize and Automate
Next, a proven method to reduce toil is standardizing your engineering processes. Standardization, said Davis, is “another word for removing ambiguity.” For example, SREs can benefit from having a common series of steps to follow when responding to alerts. Runbooks that aggregate data to respond to issues can reduce toil in the data collection process while also helping improve mean-time-to-discovery and recovery rates.
Standardizing change management is another way to reduce toil. If engineers are approaching change management in various ways, this could cause additional review and slow down the release process. Removing ambiguity from the change lifecycle can alleviate toil down the line.
When you find yourself having to retain mental models for configurations, it’s a good sign that standardization is required, explained Englund. For example, perhaps some configurations are needed in some environments but are commented out in others. Documenting the specific configurations for different development environments (and automatically populating configurations) is one way to reduce toil and the chance for human error.
3. Proactively Monitor
More proactive monitoring is another way to reduce toil, according to Englund and Davis. “Responding to a crash loop is responding too late,” added Davis. Instead, he advocated that SREs look toward leading indicators that suggest the potential for failure so that teams can make adjustments well before anything drastic occurs.
If SLIs like error rate and latency are getting bad, you must take reactive measures to fix them, causing more toil. Instead, proactive monitoring is best to see the cresting wave before the flood. Leading indicators could arise from following things like data queue operations connected to servers or the saturation of a particular resource. “If you can figure out when you’re about to fail, you can be prepared to adapt,” said Davis.
4. Practice How to Improvise
One major caveat of standardization is that you’re inevitably going to encounter edge cases that require flexibility. And when an outage or issue does arise, the remediation process is often very unique from case to case. As a result, not all investment into standardization pays out.
Alternatively, teams that know how to improvise together are proven to be better equipped for unforeseen incidents—one study on Arctic Sea emergencies found collective improvisation to be imperative for emergency response situations. Similarly, collaboration is vital for software teams. Davis calls this the “practice of practice.” If you don’t practice backup restores, for example, teams may flounder in the case of a database outage. Yet, since every remediation effort is unique, it really boils down to learning how to collaborate effectively as a team. “We can gain a lot from getting together to practice,” added Davis.
All the Small Things
Toil is like manually typing an email signature every time you are about to hit “Send.” Maybe you’ve just been putting off configuring your signature in your email client. Or, perhaps you like to write a custom sign-off for each recipient. Regardless of the reason, the longer this repetitive motion is required, it’s a stick in the spokes of productivity.
In the grand scope of things, this tedium can be a drain on time and resources. For every extra keystroke or click, every un-automated workflow or every copied-and-pasted code block, a small degree of toil is piled onto the daily workload of an engineer, dragging them away from focusing on what really matters.
Standardization, automation, proactive monitoring and collaboration are all helpful elements to consider on your journey toward reducing toil. But of course, there is the trap of increased software development automation—the time saved building automation must outweigh the upfront development cost (and ongoing maintenance effort). Otherwise, you might end up creating even more toil.