Published date: April 15, 2024, Version: 1.0
On-Call is the process of managing who is notified when an incident occurs in production workloads. It is defined by various levels of support, which is the definition of responsibilities for addressing incidents when they occur. This is usually set up like this:
L1: The first responder, responsible for triaging the incident and routing it to the correct team.
L2: The support handler for a specific application, service, or functional component, who can execute pre-defined runbooks for known or frequent issues. If this team is unable to handle the issue, they forward it to the next level.
L3: The domain experts, usually the development team who created the application or service themselves, who must respond and remediate whatever is wrong with the production deployment.
In these cases, the L3 team would have a person delegated on a rotating basis to be “on-call” for some period of time (1 day, 1 week, etc), responsible for handling all L3 escalations on behalf of the team.
Goal of On-call & Support
Goal of On-call & Support
As the On-Call Engineer (OCE), you should be focusing even more than usual on the successful production runtime of the applications you are responsible for monitoring. This means not just waiting around for if/when an incident occurs, but taking a proactive look at dashboards and alerting structure while partnering to determine root cause analysis, prevent future reoccurrence, and identify and remove toil in processes.
The main goal of the OCE is to do everything they can to make the on-call shift for the next person even easier; if everyone takes this approach then we should see a noticeable drop in the time to detect and resolve incidents and a healthier work-life balance with less incident noise! When it comes to being the OCE, you want to leave it better than you found it! A major component of this will be to provide an on-call handoff using the confluence template to keep the team and next OCE informed of any struggles or major changes you faced during your shift.
Remember - lean on the team around you for support! If you need extra help troubleshooting an issue, or need a new alert created or dashboard adjusted or further understanding on a process flow, make sure to reach out for input. Developers, App Admins, SREs, and the monitoring team are all invested in ensuring software works in production in the best way possible to meet the needs of the business, so a conversation or a Jira ticket when help is needed is just part of the flow!
Incident Response
Incident Response
Here are some tips to respond to Incidents as the OCE:
When an alert or incident fires, acknowledge the alert and begin troubleshooting
Don’t be afraid to escalate to your secondary if you need help, it’s better to have more people involved and reduce the impact of a P2 than to struggle your way through it alone.
Update knowledge base with any new guides needed
Take the lead on P1/P2 RCA
Work with problem management on RCAs
If you helped resolve an issue, co-own postmortem documentation for the weekly operational review
State of the System
State of the System
Some tips to follow when monitoring the system:
Ensure you have no outstanding alerts and are actioning to them in a timely manner
Review status of the SLOs and Error Budget for tier 1 applications
Review alerting policies – every alert should be actionable – TUNING is an action!
If there is nothing to do for an alert, review the setup with the runtime of the application and other engineers to determine a more appropriate threshold
Escalate to another team if you need help! If you need assistance tuning from HCL or SRE teams, start a conversation or create a Jira ticket!
Maybe a job runs every 4 hours that causes latency to breach for just long enough to trigger the alert, but if the alert was evaluated over 10minutes instead of 5minutes we could still capture issues without the added noise
Maybe the application alerts at 80% disk usage, but at 80% you still have 1TB of free space - should this be an informational email at 80%, alert at 90%? Or maybe there is a weekly cleanup job that runs and instance often grows to 85% used before resetting, making 90% a better threshold?
If an alert is autoresolving quickly – is it a bad threshold that is barely being breached or an odd traffic pattern?
CALCULATED tuning is important, don’t change something just to reduce alerts if it will end up in a mad scramble to resolve it at a higher threshold later.
Knowledge Management & Sharing
Knowledge Management & Sharing
The Knowledge Base is meant to be a single tagged and indexed location that is searchable to help with remediation of incidents. Sharing troubleshooting techniques or how-to articles makes it easier to resolve issues in the future as well as onboard new engineers into the on-call process. Best of all, anyone can write them! It is appropriate to request a new article be created by the development team when they introduce a new feature or function. The best source for KB articles is following discussions with development teams working on new features or as ACTION ITEMS from a recent incident and postmortem.
Lean on SREs to help get started. If you have troubleshooting guides in multiple locations, ask for help in pulling it all together into the KB. As the space grows over time, it will be easy to sort by application names or by common issues (like restarting a web service) via the tagging feature. The KB is only as good as the information in it, so if you need to create tickets as action items to help ensure documentation is kept up to date, go ahead!
Write a how-to article in the KB space
“How to replay orders from the backlog queue”
Write a troubleshooting article in the KB space
Troubleshooting high response times on the XYZ application”
Did a new incident present in an interesting way not seen before? Besides a KB article, could it be reproduced in a lower environment to share the experience with other engineers that were not on-call?
Make sure to reference any newly created articles in your on-call handoff!
KPI, Reports & Reviews
KPI, Reports & Reviews
Attend SRE KPI meetings - know your metrics!
Make sure the on-call handoff has enough information to be referenced in the operational review meetings for SLOs and Error budgets
If an incident caused an application to slip below SLO, make sure it is noted as well as links to the RCA/postmortem and any action items (pending and completed)
Think about incident response metrics - MTTD, MTTR, MTBF
If you had 2 incidents in a week that were called in by users after a day without detection, then you should expect mean time to detect measurement to increase and be prepared to speak to the actions taken on that behalf - new alerts created, etc.
If you had a prolonged outage, then mean time to recover metric may increase. Be prepared to speak to the incident and typical postmortem questions - what went wrong, what went right, where we got lucky, and action items for the future
Mean time between failures may be decreasing for an application if it’s having a lot of incidents. Reviewing what is causing so many incidents will be vital
Attend Weekly Operational Review Meeting
Attend Technical Advisory Board meetings
Toil Review
Toil Review
Always keep an eye out for toil, log it, and identify ways it could be reduced. As you find something that meets the definition of toil, submit a ticket so that SRE teams can prioritize a fix.
· Add new toil to the toil backlog with the label toil
· Submit tickets for SRE and dev teams to prioritize and fix