SLOs are internally agreed upon measurements that indicate the reliability of a service and correlate strongly to user happiness and application success. If you have an SLA with customers, the SLO should be stricter than the SLA to provide a buffer for incidents and changes; for example if the contractual SLA states 99% availability, your internal SLO should be 99.9% so you can begin working on an issue before you breach the SLA. SLOs should be defined to meet the minimum expectations of our customers so that resources are not wasted and we do not artificially inflate expectations. What is the minimum experience we can deliver that still makes the end users happy with our services?
These internal targets are actionable, data-driven, customer impacting measurements that can be trended, reviewed, and adjusted over time to ensure we are delivering the appropriate experience for our customer.
Simple SLOs
SLOs will start off simple for all applications. We will be focused on availability, latency, and error rate at an overall level for the application. These values can be reviewed historically within New Relic and charted and reviewed easily within the tool.
Maturation over time
As more information is gathered amongst services, it is natural to break down your overall SLOs into smaller components. Defining these critical user journeys (CUJ) will give you a more granular approach to the experience you are delivering. You may discover that while you are meeting your error rate SLO, it may actually be that the majority of those errors are coming from a particular flow meaning you actually have a subset of users who are very unhappy. By focusing on CUJs, you focus your efforts on specific flows that are more or less important based on business decisions. There might be components of your system that are actually “nice-to-haves” which means a looser SLO may be acceptable, and likewise some of your workflows might be critical. Over time, the teams will work out their own internal flows for more granular SLOs and as more teams are onboarded, the teams can begin to focus on cross-system workflows that are critical to end users. For example, how long should it take from the time an order is placed online until it is packed and ready for carrier pickup? Multiple applications are involved in that process and multiple local SLOs will need to be understood before we can reach that level of detail. Iteration and maturation will drive improvements to the experience over time.