SRE (Part 1): A Modern Overview

Site Reliability Engineering (SRE) is a topic that over the last several years has become a popular discussion across many companies I have interacted with. This article is the first of two that I’m passionate to share, as I’ve come across many organizations who are trying to build and implement this ideology, while all are asking the same questions—What is SRE? Who are SREs? How do we get it? Where do we start?

I—just like everyone else—have opinions around this topic. However, there is a common ground between us all. SRE is not only about tooling and tech, it’s not only about the unicorn hires you can make, but it is primarily a cultural shift within companies. It changes the way developers, product owners, admins, general engineers, etc., all interact with one another. You have to shift your operating model and how you approach building in your organization.

Now, just as a disclaimer these are just my opinions and experiences with building organizations like this and talking to other companies who have implemented or are currently implementing. There is no prescription for building SRE. Each and every company will implement at their own cadence and own internal philosophies. You will, over time, build it in a way that fits your organization and internal operating mode. Do not move too fast and attempt to force the model into your organization. Fail but fail gracefully and learn from those failure. You’ll get to your Narnia at some point.

Definitions

Let’s start with some quick definitions of the acronyms we will be using. These will be short explanations but give you the general outline.

SRE: Site Reliability Engineer, an individual who possesses an interest in infrastructure and operations, with a software engineering background.
SWE: Software Engineer, someone who writes the application/service software.
Infrastructure SRE: a variation of SRE that works on infrastructure related projects (monitoring, provisioning, IaaS, etc.).
Application/Service SRE: a variation of SRE that is dedicated to specific application/service group. These are the groups who are building the products that generate revenue for your company (hopefully).

The definition of SRE is quite scattered and I don’t think anyone truly understands/knows what it means—and no, unicorn’s will not show up below.

This is what SRE means to me:

SRE is a core group of individuals who have a wide array of skills. They are more of a technology generalist than specialist. These skill sets range from operations, infrastructure, networking, development, hardware, distributed systems, monitoring, stability, capacity planning, software engineering, etc.
SREs are responsible for architecture and implementation of technical infrastructure and supporting services, focusing on stability, security and scalability of those platforms.
SREs should be building standard best practices, platforms, services, infra, etc., that reach the global organization.
SRE is not just about the technology. SRE is a mindset, thought process and cultural shift.
SRE shouldn’t be made mandatory for everyone in your company. Teams will have the choice to fund SRE support if they need/want it.

Areas of Responsibility

Technology has so many different aspects that it may seem hard to narrow down what SREs should actually be spending their time and energy on. Since these individuals may not be directly building public facing products or features (depending on your company, of course), what exactly will they be doing?

This by no means is limited to the list below, but should be the general areas in which you start to focus your SRE organization. Without these underlying infrastructure plumbing being properly built, monitored and scaled, it will be difficult to preach this new cultural mindset to the outside team members. The items below may seem very infrastructure heavy—and some may be. Many of these can be split between application SRE and infrastructure SRE. One is the primary producer, while the other is the primary consumer.

Monitoring.
Configuration management and automation.
Infrastructure services/networking/platforms/architecture (including hardware and gen compute).
Infrastructure tooling and capacity planning.
Big data/data warehousing/data analytics.
Documentation and runbooks.
Incident response and incident management process.

Is SRE a Single Team or Multiple Teams?

There is no prescription, and results will vary depending on multiple factors. However, I recommend that this is split logically into multiple teams if you have the staff to accommodate. You can start small and have a generalist team that covers multiple areas and eventually break into smaller, more manageable pieces. The areas are:

Infrastructure (hypervisor, storage, operating systems, containers, automation, SDN).
Observability (monitoring, telemetry, event correlation, trend analysis, IN&IM).
Tooling (custom tools, configuration management, developer experience tools).
Services (databases, message queues, orchestration, micro-services).
Apps (mostly application side knowledge and ability to support/troubleshoot).

Even though they are split in some sense, the reporting chain must remain centralized throughout the organization. Decentralizing in my opinion will break the effectiveness and communication of the team and result in an unsuccessful SRE implementation. SREs should report to like-minded individuals whose mission statement is to push SRE agenda and philosophies. Agenda is not the only reason. A central hierarchy allows for governance around tooling, best practices and general technical/architectural guidance. From that, you can allow those folks to cross pollinate in other groups and teams and spread those standards across the board. A breakdown in structure and communication will inevitably cause a deterioration in the mission.

Another reason for centralization is that it allows you to make sure your SRE hires do not become a dumping ground for operational work the app teams do not want to do or have pushed off. I have seen this happen in the past; it discourages the new hires and isolates them. They do not have a mission or a central team to work with and end up building things off on their own that only fit the needs of who they are supporting, instead of seeing the bigger picture.

The application SWEs should still be doing operational related work, since they are the ones who created the need for it. The application SRE can help offload some of that work and find ways through the central chain on what tools others are using to offload. Operational issues can quickly turn into error budgeting for the application teams.

Your SREs will likely be split into two disciplines: application SRE and infrastructure SRE. This allows you to cover both sides while pushing the same mission throughout all the teams. Infrastructure SREs are out there, and there are plenty of them. I have seen it come as a natural evolution of DevOps, system engineers and system administrators. You should focus on building your internal talent pool and then going outside for hires.

Having some of the SWEs on the team volunteer and shift work over to SRE helps speed up the process for that specific application. You can leverage their knowledge of the code base and have them learn and understand the centralized mission and tooling to bring down to their stack. These individuals can also train the central team on the app stack.

However, not every team in your company will need dedicated SRE support. Part two of this series will be a deeper dive into application and infrastructure SRE, and how to organize your teams from a management perspective.

— Anthony Caiafa

Tags: site reliability engineersite reliability engineeringSRESWE

4 years ago

Anthony Caiafa

Anthony Caiafa is the chief technology officer at SS&C Technologies. Anthony is a technology leader and innovator. He has a deep understanding and experience across the entire stack, and focuses on taking his years of experience as a hands on engineer and mixing it with executive leadership.

Building an Open Source Observability Platform
By investing in open source frameworks and LGTM tools, SRE teams can effectively monitor their…
Our Infrastructure is Still Expanding
Infrastructure is expanding in almost every possible way, and this creates more of a burden…
Forget Shift Left: Why ‘No Shift’ is the Future of Software Innovation
A no shift strategy argues for developing and testing directly in production, bypassing the traditional…