Toil management is specifically the activities that drive the prioritization, execution, and de-duplication of effort in eliminating toil within and across teams. Similar to Software Development Project Management, leadership must manage a backlog of work prioritized by the business value to the organization, and intake that work to the SRE team members while ensuring that they are not duplicating work performed by other teams and ensuring that other teams can benefit from their efforts (if relevant).
Prioritization and intake are particularly important in toil management. When an incident occurs, activities and efforts to prevent the occurrence of the root cause for an incident need to be added to the team’s backlog. These tasks must be prioritized relative to existing work in the backlog, to ensure they are given the appropriate level of attention. This must be balanced by the importance of ensuring that the incident does not happen again, as well as the importance of new feature/capability delivery. This makes Toil Management a Product issue at a business level, not just an engineering concern.