Site Reliability Engineering (SRE) is a topic that over the last several years has become a popular discussion across many companies I have interacted with. This article is the first of two that I’m passionate to share, as I’ve come across many organizations who are trying to build and implement this ideology, while all are asking the same questions—What is SRE? Who are SREs? How do we get it? Where do we start?
I—just like everyone else—have opinions around this topic. However, there is a common ground between us all. SRE is not only about tooling and tech, it’s not only about the unicorn hires you can make, but it is primarily a cultural shift within companies. It changes the way developers, product owners, admins, general engineers, etc., all interact with one another. You have to shift your operating model and how you approach building in your organization.
Now, just as a disclaimer these are just my opinions and experiences with building organizations like this and talking to other companies who have implemented or are currently implementing. There is no prescription for building SRE. Each and every company will implement at their own cadence and own internal philosophies. You will, over time, build it in a way that fits your organization and internal operating mode. Do not move too fast and attempt to force the model into your organization. Fail but fail gracefully and learn from those failure. You’ll get to your Narnia at some point.
Let’s start with some quick definitions of the acronyms we will be using. These will be short explanations but give you the general outline.
The definition of SRE is quite scattered and I don’t think anyone truly understands/knows what it means—and no, unicorn’s will not show up below.
This is what SRE means to me:
Technology has so many different aspects that it may seem hard to narrow down what SREs should actually be spending their time and energy on. Since these individuals may not be directly building public facing products or features (depending on your company, of course), what exactly will they be doing?
This by no means is limited to the list below, but should be the general areas in which you start to focus your SRE organization. Without these underlying infrastructure plumbing being properly built, monitored and scaled, it will be difficult to preach this new cultural mindset to the outside team members. The items below may seem very infrastructure heavy—and some may be. Many of these can be split between application SRE and infrastructure SRE. One is the primary producer, while the other is the primary consumer.
There is no prescription, and results will vary depending on multiple factors. However, I recommend that this is split logically into multiple teams if you have the staff to accommodate. You can start small and have a generalist team that covers multiple areas and eventually break into smaller, more manageable pieces. The areas are:
Even though they are split in some sense, the reporting chain must remain centralized throughout the organization. Decentralizing in my opinion will break the effectiveness and communication of the team and result in an unsuccessful SRE implementation. SREs should report to like-minded individuals whose mission statement is to push SRE agenda and philosophies. Agenda is not the only reason. A central hierarchy allows for governance around tooling, best practices and general technical/architectural guidance. From that, you can allow those folks to cross pollinate in other groups and teams and spread those standards across the board. A breakdown in structure and communication will inevitably cause a deterioration in the mission.
Another reason for centralization is that it allows you to make sure your SRE hires do not become a dumping ground for operational work the app teams do not want to do or have pushed off. I have seen this happen in the past; it discourages the new hires and isolates them. They do not have a mission or a central team to work with and end up building things off on their own that only fit the needs of who they are supporting, instead of seeing the bigger picture.
The application SWEs should still be doing operational related work, since they are the ones who created the need for it. The application SRE can help offload some of that work and find ways through the central chain on what tools others are using to offload. Operational issues can quickly turn into error budgeting for the application teams.
Your SREs will likely be split into two disciplines: application SRE and infrastructure SRE. This allows you to cover both sides while pushing the same mission throughout all the teams. Infrastructure SREs are out there, and there are plenty of them. I have seen it come as a natural evolution of DevOps, system engineers and system administrators. You should focus on building your internal talent pool and then going outside for hires.
Having some of the SWEs on the team volunteer and shift work over to SRE helps speed up the process for that specific application. You can leverage their knowledge of the code base and have them learn and understand the centralized mission and tooling to bring down to their stack. These individuals can also train the central team on the app stack.
However, not every team in your company will need dedicated SRE support. Part two of this series will be a deeper dive into application and infrastructure SRE, and how to organize your teams from a management perspective.
Datadog today published a State of DevSecOps report that finds 90% of Java services running in a production environment are…
Linux dodged a bullet. If the XZ exploit had gone undiscovered for only a few more weeks, millions of Linux…
We're going to send email messages that say, "Hope this finds you in a well" and see if anybody notices.
I am happy and proud to announce with Daniel Newman, CEO of Futurum Group, an agreement under which Futurum has…
Most developers are using some form of DevOps practices, reports the CDF survey. Adopting STANDARD DevOps practices? Not so much.
Two thirds of developers are using AI in product development, primarily for coding, documentation, and conducting research.