This article is a preview of a talk by Stephan Lips for SLOconf 2023, on May 15 – 18. To watch this talk and many more like it, register for free at sloconf.com.
SLOs are fast becoming the industry standard to measure reliability and help teams decide when to prioritize it. The first step in adopting a service level objective (SLO) culture is to identify the metrics that matter without drowning in noise and alert fatigue. This article explores how to apply the black box concept to aggregate granular metrics into service level indicators (SLIs) that focus on the user experience as an indicator of system reliability.
To SLI or Not to SLI
In general terms, SLOs define targets for the proper level of reliability of a given product, such as a service or a website. SLOs are applied to or informed by SLIs. An SLI is a measurement determined over a metric, or a piece of data, representing some property of a service. And this is where we, as engineers, can get lost in the details, since the perpetual proximity to the systems we build and support often leads us to think of system reliability in technical terms or metrics (e.g., response time, error rate, throughput). While these are certainly valuable metrics, the user experience may be compromised even if the error rate is zero and the duration is well within SLOs. Consider, for example, the response data. Even if well-formed, it may not be current, or flat-out wrong. An error-free and quick response is of no value to a user that expects current and correct data. Error rate and response time remain valuable metrics and SLIs, but focusing exclusively on them would leave higher-level issues undetected.
We could add freshness and correctness SLIs, but by doing so, we increase the number of signals we monitor. And with each signal—or SLI and associated SLO and error budget—we increase alert frequency and make reliability reports unnecessarily complex. In other words, adding SLIs may address a particular aspect of system reliability, but it also introduces additional complexities.
Tales of Black and White Boxes
So, let’s take a step back and borrow a concept from a related discipline: Quality engineering—in particular, software testing. Tests commonly fall into one of two categories: Black box tests or white box tests.
In systems theory, the black box is an abstraction representing a class of concrete open systems that can be viewed solely in terms of its stimuli inputs and output reactions, without any knowledge of its internal workings. A given input is expected to result in a particular output, without any consideration for the processing steps. Common examples include end-to-end tests.
White box tests, on the contrary, are designed with knowledge of, and to test, internal structures and workings of an application. Common examples include unit and integration tests.
User Journey as Black Box
Now that we understand the concept of black box versus white box tests, let’s apply it to our SLIs. As mentioned above, a good SLI considers the entire user journey. Conceptually, a user journey aligns with the black box paradigm: For a given input, a particular output is expected. For example, requests to our API (the “input”) result in responses that provide fresh data to clients within a given time frame (the “output,” including success criteria). There are several aspects worth mentioning with this SLI:
● The SLI is applied at a system level
● The SLI aggregates lower-level metrics implicitly and explicitly
● The SLI is binary; it is either true or false.
These aspects combine to inform an SLI that represents the user experience (system level), via measuring many indicators by measuring only a few and supporting pass/fail attribution to an SLO target (by being binary). In other words, the user journey is measured as a black box SLI.
White Box to Black Box: An Example
Let’s consider a concrete example. A user requests a new account for a website. After the request is processed successfully, the user receives a confirmation email with an activation link. The user follows the link to activate the new account and log in. This workflow is visualized in the following sequence diagram.
Of particular interest are the account creation and user notification via email steps. Both steps occur asynchronously. In particular, the event processing engine where the request for account creation is queued offers several opportunities for insightful SLIs: Queue length, average processing time, etc. Those SLIs, however, fall into the white box category: They contribute to the user experience, yet are opaque to the user (black box). The user journey begins with the initial request for account creation (input) and ends with the email containing the activation link (output). Rephrasing the example from earlier—a user-focused (black box) SLI could be a request for a new account (the ‘input’) that results in sending an email with a valid activation link within 1 minute (the ‘output’, incl. success criteria). This single high-level SLI aggregates several lower-level metrics; it measures many things by measuring only a few.
Let’s switch to the engineering mindset mentioned in the introduction and assume the processing queue is stuck. The high-level black box SLI does not capture queue-specific metrics, suggesting a more granular SLI specific to queue size may be needed. However, white box metrics like this will affect the error budget burn of the aggregate SLO associated with the high-level black box SLI. Monitoring and observability tools will allow engineers to diagnose and troubleshoot particular issues, such as a stuck queue, while understanding the impact on system reliability (via the higher level, black box SLO’s error budget and burn rate). The solution to the stuck processing queue used in this example is not an SLI dedicated to the queue, but reliability-focused work to diagnose and correct the root cause of the queue getting stuck.
Summary
This article introduces an SLI thought model that uses a common paradigm from quality engineering. This thought model offers a different way to think about SLIs. It supports implementing the fundamental objective of SLIs and the associated reliability stack they inform: Ensure a positive and reliable user experience by measuring reliability and providing quantitative support for decisions on prioritizing development efforts. Only a happy user is a continuous user.