Your team is receiving 6 incidents a month and after reviewing the on-call handoffs and postmortems it’s clear that 4 of them were resolved by manually restarting the services on the hosts. Reviewing these incidents, you see that on average it took the on-call engineer about 30 minutes from the time the page was sent until the services were restarted and the incident was resolved. You were talking with other engineers and found a second team has the same pattern.
So what do we know about this task?
It’s manual and repetitive, tactical, devoid of long-term value, and definitely could be automated.
Let’s do an impact calculation:
Toil Impact = [ (hours to perform task once) * (yearly frequency task required) * (number of teams performing this task) ]
Toil Impact = [ (0.5 hrs) * (4/month*12) * (2 teams) ]
Toil Impact = 48 hours / year
With this information, you logged a Jira labeled toil for the SRE team to help review
Now that the Jira is created, the team begins evaluating options for how they could address this task
Option 1 – a script we can run against any server as needed to restart a service
8 hours to code and test; still manual to run the script but may reduce response from 30 minutes to 15 minutes
Time to Address Toil = [ hours spent automating and testing ]
Time to Address Toil = 8 hours
Updated Toil Impact = [ (hours to perform task following automation) * (yearly frequency task required) * (number of teams performing this task) ]
Updated Toil Impact = [ (.25) * (4/month *12) * (2 teams) ]
Updated Toil Impact = 24 hours / year
Time Savings = Toil Impact - Updated Toil Impact
Time Savings = 48 - 24
Time Savings = 24 hours
Option 2 – you review the incidents and confirm they were signaled by failed synthetic monitors. Spend time testing the validity of these alerts to ensure we can automate the full response. Once confirmed, write and test a script that can receive two parameters – hostname and service name, and restart that service on that host successfully. Ensure that an additional check fires off to confirm health following the restart, and record the service restarted by host in an informational email to the support team. This approach is estimated to take closer to 1 week for full analysis, testing, scripting, and review.
Time to Address Toil = [ hours spent automating and testing ]
Time to Address Toil = 40 hours
Updated Toil Impact = [ (hours to perform task following automation) * (yearly frequency task required) * (number of teams performing this task) ]
Updated Toil Impact = [ (0) * (4/month *12) * (2 teams) ]
Updated Toil Impact = 0 hours / year
Time Savings = Toil Impact - Updated Toil Impact
Time Savings = 48 - 40
Time Savings = 8 hours
Option 3 – fixing the root cause of the incidents. Invest 2 days reviewing with the development team using APM and tracing tools to identify any potential root cause of why the services are failing and needing to be restarted. You may uncover a commonly re-used pattern causing resource exhaustion - the symptom being addressed via restarts. Rather than automating the response, the development team spends 1.5 weeks updating the code and testing to ensure the root cause is addressed so it will not appear again.
Time to Address Toil = [ hours spent automating and testing ]
Time to Address Toil = 80 hours
Updated Toil Impact = [ (hours to perform task following automation) * (yearly frequency task required) * (number of teams performing this task) ]
Updated Toil Impact = [ (0) * (4/month *12) * (2 teams) ]
Updated Toil Impact = 0 hours / year
Time Savings = Toil Impact - Updated Toil Impact
Time Savings = 48 - 80
Time Savings = -32 hours
You’ve come up with 3 possible approaches to reducing or removing the toil in the process:
Option 1 is the quickest but still leaves the teams with work to do, work that could still pop up in the middle of the night. While it would reduce the amount of toil by half, engineers may feel indifferent about saving 15 minutes at a time if they are still being interrupted.
Option 2 takes 5 times longer to address than option 1, but completely removed the human involvement in the toil! This is a big win for the engineers on-call, but only addresses the symptom rather than looking into the root cause. Without proper research, this may build up until it’s a much bigger problem.
Option 3 is by far the longest, actually investing more hours than the time it takes to address the toil itself. The benefit is that toil is completely removed AND the root cause is addressed, ensuring the healthiest resolution for the system.
So how do you choose?
It boils down to how much effort is being put into the task versus the amount of effort to address it. While we all may agree that option 3 is the right choice on paper, the teams may not have 2 weeks to dedicate to reviewing this issue, and if it’s only causing a 30 minute interrupt 4/month it could be a hard sell to prioritize.
One approach may be to choose between option 1 and option 2 and put another bug investigation ticket into the backlog. Assuming the root cause fix is being postponed, you now have to choose between spending 8 hours of time to make the response easier for on-call engineers (saving 24hours), or spending 40 hours to save 48 hours and removing the response requirement altogether. At this point, it comes down to workload; if an engineer has the ability to dedicate 40 hours to remove the most toil possible then it may be prioritized. More likely is the smaller investment of 8 hours while teams research a possible root cause.
In the end, it all comes down to how much time a team has available, and how much time it thinks can be saved. Unless you already know the root cause to be fixed, it will always be difficult to justify investing more hours than the toil itself.