Service Level Indicators/Objectives

Published date: April 15, 2024, Version: 1.0

What are SLOs?

SLOs are internally agreed upon measurements that indicate the reliability of a service and correlate strongly to user happiness and application success. If you have an SLA with customers, the SLO should be stricter than the SLA to provide a buffer for incidents and changes; for example if the contractual SLA states 99% availability, your internal SLO should be 99.9% so you can begin working on an issue before you breach the SLA. SLOs should be defined to meet the minimum expectations of our customers so that resources are not wasted and we do not artificially inflate expectations. What is the minimum experience we can deliver that still makes the end users happy with our services?

These internal targets are actionable, data-driven, customer impacting measurements that can be trended, reviewed, and adjusted over time to ensure we are delivering the appropriate experience for our customer.

Why do we need SLOs?

SLOs are needed because at the end of the day, reliability is the most important feature of any service. It doesn’t matter how spectacular the new features are if the end user cannot access it! Think of this as the non-functional requirements (NFRs) that are discussed during the project phase. You know what the new feature can do and how it should do it, but you are also aware that if it takes longer than 5 seconds or if it fails and goes into a retry loop more than once they’re likely to abandon the experience. By defining targets around user expectations, we can ensure we’re meeting the needs of customers and drive inter-service contracts/agreements and application observability improvements.

Who defines an SLO?

Service level objectives ask the product owners and development team, under the guidance of SRE, to come to an agreement on what user satisfaction with the service actually means. Product owners are best equipped to decide how fast results need to be returned or how critical a system is in terms of its availability, and development and SRE teams need to work together to determine the feasibility of monitoring those workflows.

What are SLIs?

Service level indicators are the metrics that can easily be measured, monitored, and alerted on that are indicative of a potential impact to an SLO. As an example, say we are not getting orders processed for shipping fast enough - the SLO for time an order is placed until it is ready to ship is too slow. There are a number of reasons across multiple systems as to why this might occur, and within each system they’ll have their own SLOs for their portion of the process, but the indicators that tell a team why that specific part of the process is slow could be any number of things. The offending slow system might find their queue depth is 5x larger than normal and the system is churning through updates as fast as it can, or abnormally high CPU might be stealing resources from the core process, or data that is retrieved from a dependency may be coming back slow or incomplete. If the role of SLOs is to keep health and alerts tied to end user impacting situations, SLIs are meant to track anything that might affect that.

Crafting an SLO

Once an objective is selected based on user expectations (e.g. availability, latency, error rate), it’s important to frame it in the appropriate context. A great place to start is historical trends; what is the existing data telling you is possible? Taking a look at your system, it looks like you can respond to requests within 2 seconds most of the time with an average volume of 500 requests/second. First and foremost, ensure that 2 seconds is not an average! Averages mask the true experience of most of your consumers; instead look to measuring at the 90th or 95th percentile to know how most of your customers view your service. And don’t forget to occasionally review your maximum values; be sure to check for extreme outliers. Now that you’re measuring the latency correctly, don’t forget to frame it from the volume perspective. While you can confidently expect these latencies at 500rps, what happens if traffic doubles, or triples? A breaking point will exist where it is no longer possible to meet that SLO target, so figuring that out through load testing or at least reviewing behavior of the system to find a “high water mark” is a good activity to further solidify your SLO.

Based on the above scenario, the latency SLO may look like 2.5seconds @ 90th percentile @ 750rps. Importantly, remember to start simple and build over time. The rps measurement may not be easily identifiable early in the process and as you gather more data it is necessary to iterate and update the SLO.

Dependencies

When defining expectations for your application, it is important to remember that you cannot promise more than the sum of your dependencies. This means if you have 2 dependencies that each promise to be available 99.5% of the time, you cannot promise your own availability more than 99.003% (99.5% * 99.5%). These inter-service contracts (the SLO they are promising to you) must be part of the data you take in when crafting your own SLO. If they are not able to meet the level you need to deliver, then that is a signal for your team to focus on how you can lessen the dependency on that service; is it something you can live without, results you could cache, etc.

Historical Trends

The best place to start crafting SLOs is based on the existing visibility you have into your application. Historical trends are the easiest way to gain an understanding of what your customers have come to expect from you. Based on what you know and can learn from your monitoring, begin selecting simple SLIs and creating simple SLOs tied to user satisfaction. In practice, the main SLOs will be availability, latency, and error rates. The SLIs that feed these may include the volume, latency, and http status codes to the endpoints of your application, the queue depth or time to process 100 items in a queue, and of course the host level metrics may indicate why you are unable to deliver your workload.

Importantly, we don’t necessarily want to alert on CPU if the response times, availability, and error rates are all within acceptable range. Some systems run hotter than others meaning 85-100% of CPU might be utilized for a specific workload. As long as there are no impacts to our end users (customers, or consuming systems), then there is little reason to page someone to review a healthy system. Certainly this is on a case by case basis, but the main idea is that if the critical workflows are not being negatively affected, what are we concerned about? Now if we are getting alerts on the latency SLO, we might find that CPU at 95%, or memory at 95% and constant garbage collection are good SLIs of why that might be occurring.

Iteration

Once simple SLOs have been established the work is not done. SLOs should be periodically reviewed to determine whether they are still valid. Existing measurements will of course be reviewed during the bi-weekly SRE KPI meetings regarding whether or not existing SLOs are being met, but periodically it is important to review whether the SLOs still make sense. Any fundamental change to code, infrastructure, or consumption can drastically change the expectations on a system. Proactive review of SLOs is critical and should happen on an ad hoc basis tied to major changes to a system, as well as every time a PRR is performed for a system (every 6 months to 2 years).

How do we spend the budget?

Error budget is spent by anything that interrupts the ability to meet the SLO. A deployment that goes outside the maintenance window, an outage caused by a failing dependency or a race condition, or a series of workloads that take too long to process in queue or respond too slowly to the end user. All of these things are expected which is why we never anticipate being able to deliver satisfaction 100% of the time. As a counterpoint, if your service is very robust and not failing any SLOs for an extended period of time, you may feel you are in that “waste” zone and can choose to take more risks. These risks might mean performing an extra release offcycle to get more features out, or running a test in production to get more data around a new feature. While it is rare that a team actually chooses to “spend” the budget, it’s important to track. If you find yourself in this situation, it may mean you actually need to revisit your SLO levels as they may be too lenient for what you are delivering.

Simple SLOs

SLOs will start off simple for all applications. We will be focused on availability, latency, and error rate at an overall level for the application. These values can be reviewed historically within New Relic and charted and reviewed easily within the tool.

Maturation over time

As more information is gathered amongst services, it is natural to break down your overall SLOs into smaller components. Defining these critical user journeys (CUJ) will give you a more granular approach to the experience you are delivering. You may discover that while you are meeting your error rate SLO, it may actually be that the majority of those errors are coming from a particular flow meaning you actually have a subset of users who are very unhappy. By focusing on CUJs, you focus your efforts on specific flows that are more or less important based on business decisions. There might be components of your system that are actually “nice-to-haves” which means a looser SLO may be acceptable, and likewise some of your workflows might be critical. Over time, the teams will work out their own internal flows for more granular SLOs and as more teams are onboarded, the teams can begin to focus on cross-system workflows that are critical to end users. For example, how long should it take from the time an order is placed online until it is packed and ready for carrier pickup? Multiple applications are involved in that process and multiple local SLOs will need to be understood before we can reach that level of detail. Iteration and maturation will drive improvements to the experience over time.