Toil

Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.

By addressing this type of work and removing it, SREs can:

Improve productivity for developers
Reduce the time it takes for code to reach production
Reduce the time it takes to resolve an incident
Provide self-service tools to teams and any number of improvements are possible

The focus is on identifying toil in your workload, determining how impactful it is versus the amount of time it would take to remove it, and getting it into the toil backlog for prioritization. Reducing toil and enabling efficiencies across the teams whenever possible will lead to higher performing teams working on more interesting problems.

Quick Links

Monitoring and Metrics Release Management Toil Management Incident and Problem Management

Benefit of Toil reduction

By reducing Toil you can achieve the following benefits:

Remove issues caused by human error
Reduce mean time to recover for incidents
Remove low-value add activities from smart engineers
Reduce burnout, increase morale
More engineering time for project work and learning new skills
Reduced context switching
Less need for tribal knowledge – improved standards and process clarity
- Easier onboarding for new team members

Could the response be automated? Could a self-service portal be provided?
Think troubleshooting steps for password resets, or validating a failure was isolated not widespread

Are you taking troubleshooting steps that could go into a script to get you information faster?
Are you frequently taking the same steps and it’s becoming monotonous?
First move it to a script to run, then review the alert and use case for running it automatically when the alert is triggered

If you’re deploying the same code to multiple environments, you are the gatekeeper and the bottleneck for that task. Review the steps required and begin automating pieces of it as you can, even if at first it’s just shutting down services or modifying load balancers. Over time, string these tasks together to remove as much of your manual efforts as you can, with the ultimate goal being full infrastructure as code that can be hooked up to a job that any team can run for themselves. Now other teams are empowered to own their process, and you’re not a bottleneck!
If the release certification requires a certain amount of triage and testing to be completed, it should be automated. Faster feedback loops provides you with more time to resolve an issue or rollback if needed.

Measuring Toil

Not only do we need to perform some calculations to determine how much toil a single process represents, but it’s also important that we’re able to track the requests coming in.

For example, clearing a specific cache may take only a few minutes, and you’re really quick in the system that can perform this process. Over time, other teams recognize your value and have started reaching out to you directly for help clearing cache after they make changes or when they are troubleshooting something.

You’re performing manual, tactical work that’s really not that interesting for you, so you log a toil Jira to try and get something created to empower other teams to handle this themselves. When it comes time to review the toil backlog, the group prioritizing the effort was only able to find 5 requests logged in the last 6 months for this type of work, deeming it not impactful enough to be addressed.

You speak up letting them know that you’re actually doing it 2-3 times each week, and teams either ping you in chat or just walk over to your desk for your assistance.

This is an anti-pattern! When it comes time to select what is worth spending limited engineering hours on for automation and toil reduction, a task that is backed by Jira tickets, service requests, reoccurring incidents, etc. is much easier to justify investing in. The best way to identify and reduce toil in your daily role is to ensure that all tasks have accompanied requests or documentation demonstrating the rate they are being requested

Toil Impact Calculation

Once you have identified toil in your processes and logged a story into Jira, it’s important to try to quantify the level of effort required to perform the toil over a year. This value should help immediately frame how urgent the toil is for prioritization, for example something costing teams 10 hours over a year will be less urgent than something costing teams 100 hours. Once you have an understanding of what the effort is costing you, it’s easier to evaluate possible solutions by weighing the time required to invest versus the time you could save.

Toil Impact = [ (hours to perform task once) * (yearly frequency task required) * (number of teams performing this task) ]

To determine the impact of toil, answer these questions:

How much time am I spending to complete this task once?
How often does this task come up for me/my team on a yearly basis?
1. Can this be objectively measured, for example by tickets received for this ask?
Does this affect my team only, or would multiple teams benefit? How many?

Time to Address Toil = [ hours spent automating and testing ]

When you begin reviewing options on how to reduce or remove toil in your system, focus on:

How much time would it take to implement and test this approach?
Are there multiple options with varying amounts of time invested?
1. *NOTE: you may not have any idea on how to address the toil yet and that’s ok! As long as the Jira has been logged the team can begin to better understand the issue and work to review together.

Updated Toil Impact = [ (hours to perform task following automation) * (yearly frequency task required) * (number of teams performing this task) ]

Often toil reduction means you still have some work left to do. This might mean rather than performing 6 manual steps to apply a security patch across 20 hosts, you’re now running a single script one host at a time. This is valuable because:

You can be confident the exact same modifications are made on each host
Running a script for all 6 steps may take you 5 minutes instead of 30 minutes; saving you 25minutes * 20 hosts = over 8 hours! If you are performing this quarterly then you’re saving over 32 hours/year, so if it takes only a few hours to write the script then there is value to be gained.

Is it worth it?

Determining whether or not addressing a specific unit of toil is worth it comes down to several factors: the amount of time the task takes to perform once, how many times it’s performed throughout a year, how many other teams are performing the same task, and how much time it will save you. There is nothing set in stone that says you must save at least 10 hours, or reducing toil impact by 25% isn’t worth it. On a case-by-case basis you have to evaluate:

Size of the toil impact (a combination of hours for a task and teams affected)
Level of effort required to reduce the toil
How much toil/work remains for that task when you are done?
How much time do you have available that can be spent on toil reduction right now?