In recent years, site reliability engineering (SRE) has garnered much interest. In 2019, LinkedIn listed site reliability engineer as the second most promising job in the United States. Now, in 2021, the role continues to grow and evolve within many organizations.
Initially spearheaded at Google and credited to engineer Ben Treynor, the strategy seeks smarter accountability for application reliability. An SRE team sets service level indicators, makes error budgets for new features and uses tools like application performance monitoring (APM) to visualize performance insights, among other tasks. SREs are quickly becoming a key check to increase business output in response to new digital innovations.
I recently met with AppDynamics Regional CTO Gregg Ostrowski to discuss the emergence of SRE and how the approach is evolving. In short, it appears SREs will only continue to grow in importance across more companies, adopting new tactics like chaos engineering, and diversifying their teams with new domain knowledge in response to increasing technological complexity.
What is an SRE?
Traditionally, organizations employed system administrators, or sysadmins, to maintain operations for large computing systems and services. However, this typically produced a dichotomy: product developers want to push new features, while operations teams (sysadmins) want to make sure the existing service doesn’t break. The split between developers and operations can be unhealthy and cause friction.
Whereas sysadmin positions are detached from development and involve a lot of manual work, an SRE approach, on the other hand, takes a more engineering-focused approach to automate operations. Google’s SRE book, which has become a bible of sorts, defines the role as “What happens when you ask a software engineer to design an operations team.”
So, what does an SRE do? Well, at Google, they are in charge of availability, latency, performance, efficiency, change management, monitoring, emergency response and capacity planning for the services they operate. At Google, they spend no more than 50% of their time on traditional “ops,” like tickets, on-call and manual tasks, and are encouraged to use the remaining 50% of their time to develop automation that will replace human operations labor.
5 Ways the SRE Approach is Evolving
It’s been a few years since Google introduced the SRE concept, and naturally, unique iterations have popped up within other enterprises. But, on the whole, where is the concept heading? Ostrowski sees SREs evolving in a few key areas: growth and maturity, increased diversification with domain-specific experts and new monitoring tactics.
1. Increased Adoption
First, not all companies have embraced an SRE model. A recent study by Blameless found “… 50% of respondents employ an SRE model with dedicated engineers focused on infrastructure and tooling, or an embedded model where full-time SREs are assigned to a service.”
The SRE model is gaining momentum, but there is still room for greater adoption. There is also room for internal growth. Ostrowski sees a single SRE team as a single point of failure. “It needs to be a whole department,” he said.
In addition, SREs are gaining a more prominent voice at the table, influencing feature rollout. “With proper and mature SRE involvement, teams can’t willy-nilly deploy,” he said. Ostrowski views these teams as maintaining a critical balance between business risk and introducing new technology.
2. Larger, Diversified SRE Departments
Many companies are experiencing rising user demands, and thus must rapidly scale their application networks. Simultaneously, there has been a Cambrian explosion of deployment types — systems could be using any assortment of legacy infrastructure, mainframe, microservices, cloud environments and multiple cloud vendors. “The complexity and topology of the IT space has grown substantially, with many interdependencies,” Ostrowski said.
Due to increasingly complex technical stacks, SRE must now cover many domains. Thus, SRE departments will likely require a more diversified team of domain experts. Ostrowski likens SREs to “Navy SEALs of the IT team.” For example, assigning them with experience navigating the nuances behind Google Cloud Platform (GCP) or Microsoft Azure could maintain service reliability in respective clouds.
Individuals’ personality matters for suitability to SRE, too, said Ostrowski. The role requires a specific type of innovator who considers the repercussions of code, and loves improving processes and performances.
3. New Testing Tactics Will Emerge
Ostrowski also foresees SRE departments introducing new monitoring and testing approaches to maintain reliability. One of these is chaos engineering, which champions the idea of intentionally breaking application systems. Different strategies will undoubtedly emerge to help drive user experience (UX) and ensure that performance is always top of mind.
4. Businesses Rely on SREs to Mitigate Risk
All things will inevitably fail, at some point. SREs accept failure and learn to manage it, designing repeatable operational mitigation structures. “As companies become reliant on the consumerization of IT, business is dependent on SREs to drive business,” said Ostrowski. This reliance will likely increase as businesses tap SREs to maintain stability.
Whether it’s reducing mean time to repair (MTTR), programming service level indicators to monitor website load time, or forecasting error budgets for new feature introductions, SREs will be increasingly relied upon to maintain business stability and high performance.
“The SRE mindset is about coming to terms with a blameless environment,” said Ostrowski. They can help “balance a risk between the business and application team,” described Ostrowski.
5. SREs Steer UX
In this digital-only environment, “The application has become the business,” described Ostrowski. If the application is core to the business, monitoring the user journey is necessary to improve it. Ostrowski believes SREs oversee a unique territory that could produce valuable business insights too.
In addition to monitoring uptime to ensure URL response times meet SLAs, SREs could track UX-related insights, such as conversion rates or cart abandonment percentages. Tracking such analytics and setting standards baselines could help pinpoint problems affecting UX. It could also assist product development in designing better-performing software.
Final Thoughts
“Anything that you do more than twice has to be automated.”
-Adam Stone, CEO, D-Tools
Traditional operations teams are typically detached from product development and are culturally very different. However, within an SRE approach, operations instead aim to run systems that automate work typically performed manually by sysadmins.
In this brief introduction, we barely scratched the surface of the topic. And of course, each company is unique in its approach to hybridizing product development and operations. DevOps tenants are closely aligned with SRE — often they are intermixed or one and the same.
Ultimately, truly reaping business value from this approach will depend on breaking down silos and opening conversations — or at least automated notifications — between disparate units. Only then can operations and development coexist in the most productive way.