Site reliability engineering (SRE) is the companion to DevOps. If we take my normal divisions of DEVops versus devOPS, traditional DevOps is mostly the developer side, and SRE is mostly the Ops side. This isn’t a perfect comparison, but it is a good starting point, and for this blog, I’ll run with it.
While DevOps focuses on getting releases out consistently, quickly and with minimum disruption, SRE focuses on putting those releases into a robust infrastructure. Tracking application issues, regardless of whether you’re using DevOps, always requires information about the software and hardware infrastructure. SRE takes the proactive steps to try and make sure the environment is stable and designed to host the applications being thrown onto it. When that process fails, SRE attempts to identify the code/architectural problem and resolve it without blaming staff members—hence, the idea of “blameless.” It’s about the environment and improving it, not about the individual and punishing them.
There are a lot of interesting parts of SRE that most IT groups could look at and see as a good place to start taking advantage of. Closely tracking errors in a cross-section of IT interest areas—by application, by infrastructure, by toolset—means problem areas can be definitively identified and fixed. But there is more behind SRE. In this example, once tracking of where errors occur more frequently is in place, a selective reduction in deployments using that piece of the IT puzzle can go into effect until the issues are resolved. This is not unlike what we are doing today, but it is formalized and has measurements around it that generally are not in place.
Which brings us to Blameless, a company that just came out of stealth to champion SRE. While some large organizations with large application footprints are using SRE already, Blameless is offering a platform to make SRE more accessible to companies without large staff and application footprint.
We took a bit of time to talk with Blameless last week, and their vision is a strong one: Address directly what DevOps addresses more indirectly. Build and environment that can measure and improve the entire application ecosystem.
If nothing else, the idea of error budgets is intriguing. Counterintuitive at first glance, the idea of saying, “Piece X gets Y amount of issues before we start backing off of it or raise the priority to fix the underlying issues,” is a solid one.
If your organization is looking for a mechanism to improve reliability and stability while reducing time to resolve issues, SRE in general and Blameless specifically are worth a closer look.