Site reliability engineering (SRE) isn’t a new term or practice. The practice of applying software engineering skills and principles to operations problems and tasks happened even before site reliability engineer was a defined job title. But organizing a proactive approach to building and maintaining software drives long-term success in improving operational efficiency, data-driven roadmap planning and general uptime and reliability. All these advantages account for the widespread adoption of SRE.
In this article, I’ll dive into what it takes to get into site reliability engineering, how to adopt it within your own organization and some of the core principles and best practices you’ll need to keep in mind as you move forward in your SRE maturity journey.
What is SRE?
SRE, or site reliability engineering, is the practice of applying software engineering expertise to DevOps and operations problems. Often, this means proactively writing code and developing internal applications or services to combat reliability and performance concerns. SRE has been a practice for many years but has been more recently popularized by Google when they first published Site Reliability Engineering: How Google Runs Production Systems in March 2016.
SRE is a complicated acronym because it both relates to a job title, site reliability engineer, as well as a general practice adopted by development and IT teams, site reliability engineering. SRE teams are often organized differently depending on the organization and the services they support. Sometimes site reliability engineers (SREs) are sprinkled among development teams. This means they can vocalize concerns during roadmap planning and work closely with QA teams, collaborating on everything from shipping features to staging to managing production environments.
Another approach organizes an SRE team as a standalone group. The SRE team as a whole is focused on responsibilities including monitoring applications and infrastructure, establishing reliability metrics, tracking issues with new releases and handling on-call duties. No matter how they are organized, what remains the same for all these SREs is their initiative—reliability and performance.
DevOps Vs. SRE
The rapid adoption of DevOps practices may have you wondering, “How is SRE different from DevOps?” Commonly, the two terms go hand-in-hand. SRE is a practice that doesn’t necessarily have to be part of DevOps culture and adoption but is often implemented by organizations that are further along in their DevOps journey. The further along on your DevOps maturity model, the more likely you are to have bought into SRE practices.
The conversation isn’t ‘SRE versus DevOps; the conversation is about establishing a proactive SRE team and operating model to help strengthen your DevOps practice. Whether you choose an SRE framework integrated into your development and IT teams, or you want to make SRE a standalone business unit, you need to understand their general responsibilities.
SRE Team Responsibilities
Site reliability engineers are responsible for defining what performant and reliable really mean when you talk about your applications, services and infrastructure. The day-to-day tasks can range anywhere from instrumenting new monitoring solutions to building custom apps for technical support teams. They might ship new code to production to fix bugs or deliver new features, or they might respond to incidents in real-time and work closely with support teams to deliver positive customer experiences.
At the end of the day, site reliability engineers are the key to understanding exactly how customers experience your product and how to track system performance and reliability through your customers’ eyes. SRE teams need to figure out exactly where service boundaries exist between what development teams ship and what customers experience. From there, they need to find ways to monitor reliability and performance concerns in a way that helps your internal teams proactively identify risks and deliver better software.
SRE teams spread knowledge across the product and development teams to consistently define reliability across the entire organization. With everyone on the same page, the engineering teams can make data-driven decisions when it comes to releasing new features or improving current experiences in production.
Site Reliability Engineering Principles and Practices
At a very high level, Google defines the core of SRE principles and practices as an ability to ’embrace risk.’ Site reliability engineers balance the organizational need for constant innovation and delivery of new software with the reliability and performance of production environments.
The practice of SRE grows as the adoption of DevOps grows because they both help balance the sometimes opposing needs of the development and operations teams. Site reliability engineers inject processes into the CI/CD and software delivery workflows to improve performance and reliability but they will know when to sacrifice stability for speed. By working closely with DevOps teams to understand critical components of their applications and infrastructure, SREs can also learn the non-critical components.
Creating transparency across all teams about the health of their applications and systems can help site reliability engineers determine a level of risk they can feel comfortable with. The level of desired service availability and acceptable performance issues that you can reasonably allow will depend on the type of service you support as well. SRE principles and practices embrace experimentation and require a dedication to proactively understanding the health of the services they support.
An SRE Operating and Maturity Model
You can perform a lot of the responsibilities asked of a site reliability engineer and still have a software engineer job title. So, how do you know how mature your site reliability engineering practice is? Luckily for you, we’ll lay out a quick way to build an effective SRE operating model and track your maturity against it. An SRE operating model typically includes three elements, which you might achieve in phases:
- A team (or at least one person) dedicated to the practice of SRE.
- Deep integration and influence across product, development and operations teams.
- Autonomy to automate workflows and write code for nearly any part of your application or system.
Your SRE maturity depends on where your organization falls along these three elements of an SRE operating model. If you’ve taken the steps to make an SRE team or hire your first site reliability engineer, you’re at the beginning of your journey. If you have a team and they are a valued part of roadmap discussions, QA, deployment workflows, incident management processes, then you have a somewhat mature SRE practice.
An organization only reaches complete SRE maturity when the SRE business unit has the autonomy to automate workflows, build applications, own monitoring and alerting solutions, or interject themselves into nearly any conversation. Vocalizing performance and reliability concerns upfront and having a proactive discussion about those concerns is always better than simply ignoring them until it’s too late.
Monitoring, CI/CD and Organizational Automation
Site reliability engineers can and will automate just about anything and everything. If it can proactively detect, remediate or resolve an issue, it needs to be automated. From continuous integration and delivery practices to production environment monitoring, SREs should have some visibility into all of it. If they can identify ways to proactively uncover performance and reliability issues, then they need to have the authority to implement those changes.
Today’s DevOps and IT capabilities around automation, monitoring, artificial intelligence and machine learning give SRE teams a huge advantage when identifying issues, responding to them and fixing them. Organizations with mature DevOps and SRE practices can catch problems in staging and they can also build automated incident management workflows and self-correcting systems. By determining critical components in your applications and infrastructure, SREs can narrow down the scope of things that can cause major problems.
The Practice of Service Levels (SLIs, SLOs and SLAs)
Service levels come into play to help SRE teams communicate the true health of digital products and services to all stakeholders. This is done by identifying and measuring critical components that are key to delivering positive customer experiences. In particular, they need to know when one or more components expose functionality to external customers. We call these intersection points system boundaries. System boundaries are the place where site reliability engineers need to apply service-level indicators and objectives to their metrics in order to tell the real story of system performance and reliability.
- Service-level indicators (SLIs) are the key measurements to determine the availability of a system.
- Service-level objectives (SLOs) are goals you set for how much availability you expect out of a system.
- Service-level agreements (SLAs) are the legal contracts that explain what happens if the system doesn’t meet its SLO.
While SREs aren’t always responsible for managing service levels, it often falls within their purview. By tracking SLIs and tying them to SLOs, you can set goals around the performance of a system. Google’s SRE book defines the four golden signals of service levels as latency, traffic, errors and saturation. So, for example, you could look at an API call and track its number of successful/failed requests (the SLI) against a general percentage of requests that need to be successful for customers to have a good experience (the SLO).
SRE teams will often set strict SLOs on critical components within their applications and services to better understand how strict of an SLA they can agree to with customers. From here, the team can apply error budgets as a way to understand how quickly they must resolve issues in order to stay compliant with their SLOs. Service levels allow teams to aggregate metrics and create a transparent view of uptime, performance, and reliability across the entire organization. At a glance, business leaders can use service levels to monitor compliance across multiple teams, applications, services, etc. to gain a comprehensive understanding of their system’s health.
Adopting SRE Best Practices
Adopting SRE best practices and principles won’t happen overnight. It takes time and effort to proactively monitor your teams and systems for performance and reliability concerns. But, in the end, your DevOps teams and especially your customers will thank you for deciding to take advantage of site reliability engineering.