DevOps.com

  • Latest
    • Articles
    • Features
    • Most Read
    • News
    • News Releases
  • Topics
    • AI
    • Continuous Delivery
    • Continuous Testing
    • Cloud
    • Culture
    • DataOps
    • DevSecOps
    • Enterprise DevOps
    • Leadership Suite
    • DevOps Practice
    • ROELBOB
    • DevOps Toolbox
    • IT as Code
  • Videos/Podcasts
    • Techstrong.tv Podcast
    • Techstrong.tv Video Podcast
    • Techstrong.tv - Twitch
    • DevOps Unbound
  • Webinars
    • Upcoming
    • On-Demand Webinars
  • Library
  • Events
    • Upcoming Events
    • On-Demand Events
  • Sponsored Content
  • Related Sites
    • Techstrong Group
    • Container Journal
    • Security Boulevard
    • Techstrong Research
    • DevOps Chat
    • DevOps Dozen
    • DevOps TV
    • Techstrong TV
    • Techstrong.tv Podcast
    • Techstrong.tv Video Podcast
    • Techstrong.tv - Twitch
  • Media Kit
  • About
  • Sponsor
  • AI
  • Cloud
  • Continuous Delivery
  • Continuous Testing
  • DataOps
  • DevSecOps
  • DevOps Onramp
  • Platform Engineering
  • Low-Code/No-Code
  • IT as Code
  • More
    • Application Performance Management/Monitoring
    • Culture
    • Enterprise DevOps
    • ROELBOB
Hot Topics
  • Azure Migration Strategy: Tools, Costs and Best Practices
  • Blameless Integrates Incident Management Platform with Opsgenie
  • OpenAI Hires 1,000 Low Wage Coders to Retrain Copilot | Netflix Blocks Password Sharing
  • Red Hat Brings Ansible Automation to Google Cloud
  • Three Trends That Will Transform DevOps in 2023

Home » Blogs » Best of 2022: Day in the Life of a Site Reliability Engineer (SRE)

Best of 2022: Day in the Life of a Site Reliability Engineer (SRE)

Avatar photoBy: Bill Doerrfeld on December 23, 2022 Leave a Comment

As we close out 2022, we at DevOps.com wanted to highlight the most popular articles of the year. Following is the latest in our series of the Best of 2022.

By now, most of us are familiar with the concept of site reliability engineering (SRE). The term was originally coined by Google and SRE has been gaining traction in recent years as a role dedicated to increasing the resilience of digital ecosystems. To accomplish this, a big part of the SRE doctrine is “automating your job away.”

TechStrong Con 2023Sponsorships Available

Observability and monitoring tools are extremely important to DevOps and for SRE. But what is the day-to-day experience actually like for your average SRE? What activities does an SRE do? And what percentage of their time does an SRE spend on particular tasks?

I recently met with James Curtis, lead site reliability engineer at a large multinational company. According to Curtis, the SRE approach takes a certain type of person—someone with Ops in their blood and a passion for eradicating repetitive tasks. Below, we’ll use his input and the input from other sources to better understand what it’s like being in the shoes of an SRE.

Understanding the SRE Role

Organizations have interpreted the role in different ways. But in a general sense, SREs attempt to maintain high reliability and availability for software applications and respond to incidents as they occur. To aid their efforts, an SRE tries to streamline and automate as many operations as possible to remove opportunities for human error.

A hallmark SRE goal is to reduce “toil.” Curtis defines toil as “tedious actions that really have no enduring value.” For example, say an admin must manually restart a service in Microsoft Exchange every time a triggering event interrupts the service—this action could certainly be automated away. SREs spend much of their time eliminating toil by coding automation and configuring internal tools to better interact with software infrastructure.

SREs usually are also in charge of logs and setting benchmarks using a tool like Splunk or Datadog to observe and ingest data. Curtis himself uses Cribl, which offers an observability pipeline to parse and route log data. Since SREs oversee internal service-level indicators, they are typically in charge of normalizing behavior and setting SLOs and SLAs.

Common SRE Activities

An SRE juggles a lot of tasks. For proof, read the hour-by-hour day in the life by Yonatan Schultz, SRE at New Relic. Schultz’s average day is spent configuring infrastructure, jumping from project to project, and, of course, hopping into many meetings. Here are some other tasks an SRE might perform on a daily basis:

Monitoring service-level indicators (SLIs). An SLI could be the number of successful requests out of total requests. Having a high SLI, in this case, would be a target. SREs track other metrics such as availability, uptime performance, latency, error count and throughput. Regularly monitoring systems is essential to ensure proper resource utilization of containers and to avoid out-of-memory (OOM) errors.

Setting SLOs and SLAs and determining error budgets. Once you have determined baseline system performance, you can set service-level objectives (SLOs). These are typically internal targets like 99.99% availability. While SREs typically oversee functional metrics, some teams set goals for non-functional metrics, as well. SREs help determine service-level agreements (SLAs), which are more legally binding and typically partner-facing.

Responding to incidents. On-call SREs will be tasked with finding the root cause of issues as they arise. When triaging an incident, it’s helpful to have all the necessary logs and tools immediately at hand. This is one area where automation can assist by pulling relevant details to instantly build a case, said Curtis.

Writing postmortems. After an incident has been dealt with, it’s important to learn from it. Postmortems are common in cybersecurity practice and often fall under the responsibility of an SRE. These reviews seek to answer set criteria to get to the heart of an incident and identify the root cause(s) of an issue to prevent it from happening again.

Automating other system tasks. SREs will spend significant time coding and building tools for engineers to interact with infrastructure. For instance, an SRE might generate reliability reports that consider performance over long time periods.

Cross-department collaboration. SREs don’t tend to own application code. Instead, they support multiple software divisions. This means checking in with other developers, disseminating best practices and reviewing new architectures to represent the reliability side of the equation.

As you can see above, the SRE role might blend many different activities, and keeping track of them all may be part of the job itself. Anika Mukherji, an SRE at Pinterest, shared that, at Pinterest, there is a weekly meeting where SREs share what they spent time on. For another helpful “day in the life” story, take this account of an average OpenShift SRE’s day. Nikita spends her day responding to open JIRA cards, handling incidents, pushing code to GitHub and syncing with SREs in other regions when shifts change.

Time Well Spent

So, how does an SRE allocate their time? As Curtis explained, the ideal goal is a 50/50 split between time spent in work mode and time trying to automate that work away. Of course, this is more like a sliding scale, he admits. When things are broken, attention naturally shifts toward more manual work, says Curtis. While there may be less automation upfront, the scale balances out the more you build.

This is similar at other institutions too. For example, in an interview with DZone, Paul Greig explained that half of his time is spent on service reliability upkeep and half on toil reduction. John Turner, SRE at Squarespace, said some 70% of his time is spent writing code—much of it automation code.

Who’s Fit to Be an SRE?

As mentioned above, the SRE job requires a specific attitude toward solving operational problems. Curtis described this person as “someone who hates the monotony of doing something over and over.” This person should have a drive to continuously solve new problems because as soon as you’ve automated one thing, you move on to the next, he said.

Don’t assume a role with high automation equals an easy paycheck. We’ve all heard the story of the system admin who secretly automated their job away and never told a soul. While this may fly in other roles, the job of an SRE is never finished. “You’re never going to run out of stuff to optimize,” Curtis said.

Again, the SRE role is a blend of many different activities. You will likely interface with different developer teams, too. Since the role involves this type of interaction, communications skills and understanding are a must.

Final Thoughts

The SRE practice is wide and varied—each company may have its unique flavor. Looking to the future, there are plenty of use cases for machine learning to further empower SRE practices, said Curtis. This is especially relevant in security automation, where algorithms could be trained against real attacks to flag suspicious behavior. Or, with the right amount of data, predictive analytics could be applied to anticipate high CPU peaks, informing server utilizations.

In both scenarios, Curtis stressed the importance of observability. “It really gives you the ability to look at data and to ask a question later you didn’t realize you needed to ask,” he said. This power hinges on easy data transformation to make things speak the same language. That’s why his team opted for Cribl, that allowed them to normalize that data on the fly and replay it later. “The ability to change data and morph it—that’s the power observability gives you.”

Recent Posts By Bill Doerrfeld
  • How To Build Anti-Fragile Software Ecosystems
  • Software Supply Chain Security Debt is Increasing: Here’s How To Pay It Off
  • 6 Ways To Empower Developers and Increase Productivity
Avatar photo More from Bill Doerrfeld
Related Posts
  • Best of 2022: Day in the Life of a Site Reliability Engineer (SRE)
  • LinkedIn Preps Site Reliability Engineers (SREs) For Exciting Careers
  • Building DevOps Careers: One Man’s Journey, Part 1
    Related Categories
  • Application Performance Management/Monitoring
  • Best of 2022
  • Blogs
  • Continuous Delivery
  • Continuous Testing
  • DevOps and Open Technologies
  • DevOps Culture
  • Features
  • IT Administration
    Related Topics
  • careers
  • DevOps careers
  • observability
  • site reliability engineering
  • SRE
Show more
Show less

Filed Under: Application Performance Management/Monitoring, Best of 2022, Blogs, Continuous Delivery, Continuous Testing, DevOps and Open Technologies, DevOps Culture, Features, IT Administration Tagged With: careers, DevOps careers, observability, site reliability engineering, SRE

« Best of 2022: Environments-as-a-Service: Free Your Devs
Best of 2022: Microsoft Adds Raft of Tools and Azure Cloud Extensions »

Techstrong TV – Live

Click full-screen to enable volume control
Watch latest episodes and shows

Upcoming Webinars

Automating Day 2 Operations: Best Practices and Outcomes
Tuesday, February 7, 2023 - 3:00 pm EST
Shipping Applications Faster With Kubernetes: Myth or Reality?
Wednesday, February 8, 2023 - 1:00 pm EST
Why Current Approaches To "Shift-Left" Are A DevOps Antipattern
Thursday, February 9, 2023 - 1:00 pm EST

Sponsored Content

The Google Cloud DevOps Awards: Apply Now!

January 10, 2023 | Brenna Washington

Codenotary Extends Dynamic SBOM Reach to Serverless Computing Platforms

December 9, 2022 | Mike Vizard

Why a Low-Code Platform Should Have Pro-Code Capabilities

March 24, 2021 | Andrew Manby

AWS Well-Architected Framework Elevates Agility

December 17, 2020 | JT Giri

Practical Approaches to Long-Term Cloud-Native Security

December 5, 2019 | Chris Tozzi

Latest from DevOps.com

OpenAI Hires 1,000 Low Wage Coders to Retrain Copilot | Netflix Blocks Password Sharing
February 2, 2023 | Richi Jennings
Red Hat Brings Ansible Automation to Google Cloud
February 2, 2023 | Mike Vizard
Three Trends That Will Transform DevOps in 2023
February 2, 2023 | Dan Belcher
The Ultimate Guide to Hiring a DevOps Engineer
February 2, 2023 | Vikas Agarwal
Automation Challenges Holding DevOps Back
February 1, 2023 | Mike Vizard

TSTV Podcast

On-Demand Webinars

DevOps.com Webinar ReplaysDevOps.com Webinar Replays

GET THE TOP STORIES OF THE WEEK

Most Read on DevOps.com

What’s Ahead for the Future of Data Streaming?
January 27, 2023 | Danica Fine
Stream Big, Think Bigger: Analyze Streaming Data at Scale
January 27, 2023 | Julia Brouillette
New Relic Bolsters Observability Platform
January 30, 2023 | Mike Vizard
Jellyfish Adds Tool to Visualize Software Development Workfl...
January 31, 2023 | Mike Vizard
Let the Machines Do It: AI-Directed Mobile App Testing
January 30, 2023 | Syed Hamid
  • Home
  • About DevOps.com
  • Meet our Authors
  • Write for DevOps.com
  • Media Kit
  • Sponsor Info
  • Copyright
  • TOS
  • Privacy Policy

Powered by Techstrong Group, Inc.

© 2023 ·Techstrong Group, Inc.All rights reserved.