Is Site Reliability Engineer (SRE) another case of “Do what Google does”? When viewed with a discerning lens, the practices and tooling high-performing engineering teams deliver to the market can be very beneficial. The SRE model is one of them and stands to benefit DevOps teams substantially.
If I could change the name from Site Reliability to Service Reliability I would. Applications are increasingly more service oriented, architectures are based on services and the SRE model is generally more customer centric than previous application support approaches. SRE has become the production support companion for modern application development.
While the Network Operations Center (NOC) and SRE primary objectives are to focus on application uptime, the responsibility of supporting applications is much greater. Those who support modern applications are under pressure from the business and development to respond fast and accurately. This has been done previously by a NOC only.
In its most traditional form, the NOC was a 24x7x365 staffed room where all telemetry from various aspects of infrastructure and the application were presented on screens, usually with one screen dedicated to sports and video games. If you staffed the NOC, your job was to keep an eye on the application, spot any alarms and if there is an alarm, to find the right person—usually based on a spreadsheet—to fix it.
The NOC has also classically been more tied to the infrastructure than the application. They would get alarms for the application, but they’d be unable to dig in and figure out what specifically in the application was causing the issue. The NOC model services waterfall development and monolithic architectures just fine. But now, with rapid release velocity and new application architectures, a new approach is needed. This is where SRE comes in.
Customer Obsession
The first big difference between the SRE model and the traditional NOC model is that the SRE role does not start when something breaks, it starts before. The most high-performing SRE teams I’ve seen always work backwards from the customer. They use the customer to decide what metrics to measure and collect, and how they will interpret the metrics.
Google talks about RED (Request, Errors, Duration) metrics but these are in no way best practice for everyone, nor the only approach. They are the best practice for Google, perhaps, and how their users interact with the application. Things such as saturation, latency, etc., can be even more important. Also, it is not the individual metrics that matter, it is their relationship to each other and, ultimately, to the customer’s experience. The customer obsession bleeds into everything SREs do—from setting up automation to educating the rest of the technical teams.
To be customer obsessed, SRE teams need to have good visibility into what is going on.
Observability
I’m sure you have heard the new popular term “observability.” You might have thought to yourself, “Isn’t that just monitoring?” Observability’s foundation is monitoring, but it’s purpose is to go from the static point-in-time view of applications that traditional monitoring has been used for. It also addresses new challenges that show up as application architectures become increasingly more fragmented with services, and more complex to monitor holistically.
Observability in some ways is about removing the human observer. In the NOC, you have staff watching aspects of the application and infrastructure to decide when there is an issue. Platforms that are designed to be observable have the detectors to tell you when there is a problem. It also does not create an assumption that those on-call have existing underlying knowledge of how all layers of the application work in advance because with modern applications, this assumption is not possible.
Observability is both a tooling approach, but also a strategy for building applications. Applications should be built to be observable, which means the application provides sufficient telemetry to support its observability. Monitoring is basically part of your application now.
Additionally, application degradation (such as latency) is the new downtime, so the ability to measure depth and breadth of application activity is critical. Observability helps identify reduced performance of the application as well, or impending issues in services that could lead to catastrophic events.
Automation
There is one big assumption that comes with the SRE model: automation and the interest to automate. The SRE model enables greater context and visibility across the organization, but it has to do that without increasing human effort. Without automation that is not possible.
Automation includes:
- Going from insights from observability to incident response.
- Gathering context from systems to report, to observability tools and incident response.
- Mechanisms to support those who are on-call with self-service remediation.
- Automation to understand similar incidents.
- Automation to alert/page/call people based on on-call schedules and policies.
- Automation to recommend responders based on their incident response history.
Anyone Can Be On-Call
One of the huge benefits of the SRE model is anyone can be on call. The practices and automation of the SRE model allow anyone to be added to a rotation and on-call schedule. Most companies have this as a dedicated team, and in those who are transitioning from the traditional NOC to modern, that is exactly the first change that takes place. They leave the dedicated physical room and are on-call virtually instead.
Not just people who support the application are on-call, the SRE model allows for developers and quality engineering to be on-call as well. Arguably, you can’t embrace shift-left without embracing SRE.
Context
Automation without data does not create a truly observable and actionable environment. Leveraging data across your delivery chain that can support action is critical. For example, this means the ability to leverage tracing to pinpoint production issues down to a service, or embed more context as part of the payload of alerts to bring more clarity to those on-call.
Data collection is one thing, but not knowing how you are going to leverage the data makes it useless. The SRE model is obsessed with leveraging data and doing controlled experiments to ensure the data collected supports the uptime and experience for the user.
Stewardship and Strategy
Finally, SREs spend more time on stewardship and strategy than the traditional NOC. Some SRE groups run DoJos for the organization to help spread the idea of quality and application support across all technical teams. They will collaborate with developers so they better understand aspects of the application the SRE processes can leverage to support it better. SREs also spend a lot of time creating the strategy of application production support for the organization. Choosing key metrics, deciding how automation will be setup and consumed and creating policies on how on-call should function.
Before you get upset, SRE does not mean the NOC is dead. The modern NOC can be driven by an SRE model, and I’ve seen this in-place in several companies. The NOC is also necessary to support many on-prem applications.
What happens is, instead of a staffed room, the NOC becomes a group of engineers who are building automation, setting policies and practices, and stewarding quality across all teams. Similar to DevOps, SRE has elements of culture, principles and practice. Organizations who have a team structure where the NOC is responsible for application support, oftentimes do not change this for SRE, they become the SRE.
The NOC is changing, and has to change as modern applications become more complex, release velocity increases and users become more demanding. The SRE model is an approach to transform the NOC into a modern version of itself or a whole new organization and strategy. The goal is not only to support applications when something breaks, but to make sure customer’s expectations are met, and systems support current and future development efforts, without increasing the human workload.