Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.
By addressing this type of work and removing it, SREs can:
The focus is on identifying toil in your workload, determining how impactful it is versus the amount of time it would take to remove it, and getting it into the toil backlog for prioritization. Reducing toil and enabling efficiencies across the teams whenever possible will lead to higher performing teams working on more interesting problems.
Once you have identified toil in your processes and logged a story into Jira, it’s important to try to quantify the level of effort required to perform the toil over a year. This value should help immediately frame how urgent the toil is for prioritization, for example something costing teams 10 hours over a year will be less urgent than something costing teams 100 hours. Once you have an understanding of what the effort is costing you, it’s easier to evaluate possible solutions by weighing the time required to invest versus the time you could save.
Toil Impact = [ (hours to perform task once) * (yearly frequency task required) * (number of teams performing this task) ] |
To determine the impact of toil, answer these questions:
How much time am I spending to complete this task once?
How often does this task come up for me/my team on a yearly basis?
Can this be objectively measured, for example by tickets received for this ask?
Does this affect my team only, or would multiple teams benefit? How many?
Time to Address Toil = [ hours spent automating and testing ] |
When you begin reviewing options on how to reduce or remove toil in your system, focus on:
How much time would it take to implement and test this approach?
Are there multiple options with varying amounts of time invested?
*NOTE: you may not have any idea on how to address the toil yet and that’s ok! As long as the Jira has been logged the team can begin to better understand the issue and work to review together.
Updated Toil Impact = [ (hours to perform task following automation) * (yearly frequency task required) * (number of teams performing this task) ] |
Often toil reduction means you still have some work left to do. This might mean rather than performing 6 manual steps to apply a security patch across 20 hosts, you’re now running a single script one host at a time. This is valuable because:
You can be confident the exact same modifications are made on each host
Running a script for all 6 steps may take you 5 minutes instead of 30 minutes; saving you 25minutes * 20 hosts = over 8 hours! If you are performing this quarterly then you’re saving over 32 hours/year, so if it takes only a few hours to write the script then there is value to be gained.
Determining whether or not addressing a specific unit of toil is worth it comes down to several factors: the amount of time the task takes to perform once, how many times it’s performed throughout a year, how many other teams are performing the same task, and how much time it will save you. There is nothing set in stone that says you must save at least 10 hours, or reducing toil impact by 25% isn’t worth it. On a case-by-case basis you have to evaluate:
Size of the toil impact (a combination of hours for a task and teams affected)
Level of effort required to reduce the toil
How much toil/work remains for that task when you are done?
How much time do you have available that can be spent on toil reduction right now?