Delivering fast, reliable digital services today is a lot like Olympian alpine skiing. These services must deftly maneuver a series of perilous passages en route to end users, all while maintaining the astounding speed we now take for granted.
In an SRE’s world, those passages are today’s increasingly complex and interconnected internet infrastructure through which they must deliver and support cutting-edge application functionality–all while rolling out updates at a relentless pace. It may appear easy to end users, but SREs are working incredibly hard behind the scenes, and as the internet grows more intertwined and treacherous, the risk for catastrophe increases.
Judging from the results of our second annual SRE survey, striking this critical balance between supporting development velocity while maintaining site reliability is an onerous and stressful task. Here’s a look at some of the key findings, and takeaways for organizations increasingly reliant on this emerging role.
The SRE Role Is Relatively New
Even though Google coined the term in 2003, 64% of respondents noted their SRE disciplines have only been in existence for three years or less. This means many SREs and SRE teams—those professionals dually knowledgeable in software development and adapting IT systems to meet the needs of particular software—are still adjusting to their roles and responsibility. Many SRE teams are also small relative to the size and scope of the infrastructure to be managed, and they remain a scarce resource, with many organizations highlighting skills shortages in this area.
IT Incidents Are the “New Normal”
Almost half (49%) of survey respondents reported they had worked on a service incident over the course of the past week. In the month of June alone, Google—which is known for having a very mature SRE practice—had two significant outages: One that hit YouTube, G Suite, and several popular third-party apps relying on Google Cloud for their back-ends (including Discord and Snapchat), and the Google Calendar outage just a few weeks later.
Catchpoint has detected a significant uptick in outages in recent months, likely the result of growing internet complexity which naturally increases the potential for problems impacting digital service reliability. This is a challenge for all companies and their SRE teams, including the biggest and the strongest—the last few weeks have shown us no one is immune. Organizations can help their SREs maintain a healthy perspective by bearing this in mind.
Many SREs Don’t Have Clearly Defined Service Level Objectives (SLOs)
This is problematic, because without SLOs, it is nearly impossible to identify what is an incident in the first place. Twenty-seven percent of SREs reported they don’t have any SLOs, and when flags are raised on everything, this leads to more alerts, a greater number of false positives, and naturally, greater SRE fatigue and stress. Our own conversations with SREs have revealed that even among those with SLOs in place, these are often unrealistic or not correlated to end-user satisfaction and happiness—for example, targeted at 100% availability (impossible perfection) when 98% availability may suit end users just fine. If you’re striving for perfection, every little thing that goes wrong constitutes an emergency, even if customers don’t notice. Setting unrealistic targets and not letting customer satisfaction define SLAs is a recipe for SRE burnout.
Many SREs Feel Burnt Out and Stressed
According to our survey, 21% of respondents indicated they never experience post-incident stress. However, it was interesting to note the answers to the following question, where we asked what types of changes SREs notice after an incident—in concentration, ability to sleep, mood and more. One would suspect if an SRE never experienced post-incident stress, he/she would select “none” for the second question—but that wasn’t the case. Rather, one third of SREs said they have noticed such symptoms. Sixty-seven percent of SREs who reported feeling stress after each incident, indicated they don’t believe their company cares about their well-being.
We are aware of other surveys showing SREs are quite happy, particularly given their impressive compensation. While we don’t believe being an SRE automatically makes someone stressed and unhappy, organizations should never assume these professionals know and understand how appreciated they are. There is often a reluctance to discuss topics such as job-related stress, as evidenced by the fact that 50% of the SREs we targeted opted not to complete our survey—perhaps a result of SREs’ predominant hero culture. In addition to compensation, organizations should consider other ways to show SREs—whether they vocalize their stress levels or not—they care about their well-being, such as offering an extra vacation day or other perks after a particularly trying incident.
SREs Could Benefit from Automation, but Don’t Have Enough
Another thing we consistently heard in our conversations with SREs is they feel so busy putting out fires today they don’t get a chance to create a better tomorrow—in other words, they lack the opportunity to be proactive and preemptive in their approaches. A key factor in this is the amount of toil SREs face daily—meaning manual, repetitive, automatable, tactical work. Our survey found 59% of SREs believe there is too much of it in their jobs, and too little automation. Nobody strongly agreed with the statement, “We have used automation to reduce toil,” while almost half disagreed or strongly disagreed.
Considering SRE teams are often small and the talent pool tight, it becomes critical for organizations to automate as much work as possible, including areas such as application performance monitoring. The SRE team at Alaska Airlines recently automated performance monitoring and deep-dive diagnostics, and as a result has been able to reduce their alert noise by 92%. Ultimately, this has freed up the team to focus on real issues versus wasting time investigating false positives. This has helped them bring down the average mean time to detection for real issues from hours to less than 10 minutes.
Conclusion
The stage has been set for SREs to experience significant stress, at a time when organizations need their exceedingly rare skillsets the most. As internet complexity increases, it is critical for organizations to continually nurture, encourage and motivate these valuable team members. This means understanding and empathizing with the intense pressures they’re under; having SLOs well-defined, realistic and meaningful; leveraging soft skills to communicate effectively and non-monetary incentives to show appreciation; and providing them with needed automation so they can focus on proactive improvements that advance the business.