As we close out 2022, we at DevOps.com wanted to highlight the most popular articles of the year. Following is the latest in our series of the Best of 2022.
By now, most of us are familiar with the concept of site reliability engineering (SRE). The term was originally coined by Google and SRE has been gaining traction in recent years as a role dedicated to increasing the resilience of digital ecosystems. To accomplish this, a big part of the SRE doctrine is “automating your job away.”
Observability and monitoring tools are extremely important to DevOps and for SRE. But what is the day-to-day experience actually like for your average SRE? What activities does an SRE do? And what percentage of their time does an SRE spend on particular tasks?
I recently met with James Curtis, lead site reliability engineer at a large multinational company. According to Curtis, the SRE approach takes a certain type of person—someone with Ops in their blood and a passion for eradicating repetitive tasks. Below, we’ll use his input and the input from other sources to better understand what it’s like being in the shoes of an SRE.
Understanding the SRE Role
Organizations have interpreted the role in different ways. But in a general sense, SREs attempt to maintain high reliability and availability for software applications and respond to incidents as they occur. To aid their efforts, an SRE tries to streamline and automate as many operations as possible to remove opportunities for human error.
A hallmark SRE goal is to reduce “toil.” Curtis defines toil as “tedious actions that really have no enduring value.” For example, say an admin must manually restart a service in Microsoft Exchange every time a triggering event interrupts the service—this action could certainly be automated away. SREs spend much of their time eliminating toil by coding automation and configuring internal tools to better interact with software infrastructure.
SREs usually are also in charge of logs and setting benchmarks using a tool like Splunk or Datadog to observe and ingest data. Curtis himself uses Cribl, which offers an observability pipeline to parse and route log data. Since SREs oversee internal service-level indicators, they are typically in charge of normalizing behavior and setting SLOs and SLAs.
Common SRE Activities
An SRE juggles a lot of tasks. For proof, read the hour-by-hour day in the life by Yonatan Schultz, SRE at New Relic. Schultz’s average day is spent configuring infrastructure, jumping from project to project, and, of course, hopping into many meetings. Here are some other tasks an SRE might perform on a daily basis:
Monitoring service-level indicators (SLIs). An SLI could be the number of successful requests out of total requests. Having a high SLI, in this case, would be a target. SREs track other metrics such as availability, uptime performance, latency, error count and throughput. Regularly monitoring systems is essential to ensure proper resource utilization of containers and to avoid out-of-memory (OOM) errors.
Setting SLOs and SLAs and determining error budgets. Once you have determined baseline system performance, you can set service-level objectives (SLOs). These are typically internal targets like 99.99% availability. While SREs typically oversee functional metrics, some teams set goals for non-functional metrics, as well. SREs help determine service-level agreements (SLAs), which are more legally binding and typically partner-facing.
Responding to incidents. On-call SREs will be tasked with finding the root cause of issues as they arise. When triaging an incident, it’s helpful to have all the necessary logs and tools immediately at hand. This is one area where automation can assist by pulling relevant details to instantly build a case, said Curtis.
Writing postmortems. After an incident has been dealt with, it’s important to learn from it. Postmortems are common in cybersecurity practice and often fall under the responsibility of an SRE. These reviews seek to answer set criteria to get to the heart of an incident and identify the root cause(s) of an issue to prevent it from happening again.
Automating other system tasks. SREs will spend significant time coding and building tools for engineers to interact with infrastructure. For instance, an SRE might generate reliability reports that consider performance over long time periods.
Cross-department collaboration. SREs don’t tend to own application code. Instead, they support multiple software divisions. This means checking in with other developers, disseminating best practices and reviewing new architectures to represent the reliability side of the equation.
As you can see above, the SRE role might blend many different activities, and keeping track of them all may be part of the job itself. Anika Mukherji, an SRE at Pinterest, shared that, at Pinterest, there is a weekly meeting where SREs share what they spent time on. For another helpful “day in the life” story, take this account of an average OpenShift SRE’s day. Nikita spends her day responding to open JIRA cards, handling incidents, pushing code to GitHub and syncing with SREs in other regions when shifts change.
Time Well Spent
So, how does an SRE allocate their time? As Curtis explained, the ideal goal is a 50/50 split between time spent in work mode and time trying to automate that work away. Of course, this is more like a sliding scale, he admits. When things are broken, attention naturally shifts toward more manual work, says Curtis. While there may be less automation upfront, the scale balances out the more you build.
This is similar at other institutions too. For example, in an interview with DZone, Paul Greig explained that half of his time is spent on service reliability upkeep and half on toil reduction. John Turner, SRE at Squarespace, said some 70% of his time is spent writing code—much of it automation code.
Who’s Fit to Be an SRE?
As mentioned above, the SRE job requires a specific attitude toward solving operational problems. Curtis described this person as “someone who hates the monotony of doing something over and over.” This person should have a drive to continuously solve new problems because as soon as you’ve automated one thing, you move on to the next, he said.
Don’t assume a role with high automation equals an easy paycheck. We’ve all heard the story of the system admin who secretly automated their job away and never told a soul. While this may fly in other roles, the job of an SRE is never finished. “You’re never going to run out of stuff to optimize,” Curtis said.
Again, the SRE role is a blend of many different activities. You will likely interface with different developer teams, too. Since the role involves this type of interaction, communications skills and understanding are a must.
Final Thoughts
The SRE practice is wide and varied—each company may have its unique flavor. Looking to the future, there are plenty of use cases for machine learning to further empower SRE practices, said Curtis. This is especially relevant in security automation, where algorithms could be trained against real attacks to flag suspicious behavior. Or, with the right amount of data, predictive analytics could be applied to anticipate high CPU peaks, informing server utilizations.
In both scenarios, Curtis stressed the importance of observability. “It really gives you the ability to look at data and to ask a question later you didn’t realize you needed to ask,” he said. This power hinges on easy data transformation to make things speak the same language. That’s why his team opted for Cribl, that allowed them to normalize that data on the fly and replay it later. “The ability to change data and morph it—that’s the power observability gives you.”