Recently, I was asked on a podcast to explain the relationship between site reliability engineering (SRE) and observability and why I believe organizations need to move their focus from the former to the latter. Those are great questions, but to answer them accurately, I first need to point out that many organizations claim to be doing SRE or claim to be doing observability but aren’t.
One big thing that SRE and observability have in common is that they are both often erroneously equated with monitoring.
SRE ≠ Monitoring
Google is credited with pioneering the SRE movement with the publication of the “Site Reliability Engineering” handbook in 2016. As the SRE movement gained momentum, organizations of all sizes pored over the handbook and began applying its principles. However, very few organizations are of the size and sophistication of Google regarding the scale of their IT operations and engineering resources.
So, as these “normal” organizations realized how difficult it was to follow the Google SRE approach in its entirety, they often opted instead to simply apply what they could. For many, the chapter on monitoring became the focus, so much so that monitoring has become synonymous with SRE in far too many organizations today.
Let me be clear: The role of a site reliability engineer is not to monitor alerts. The role of an SRE is to define how the engineering team should take ownership of their service. SREs are responsible for establishing a culture and creating engrained processes that are focused on the quality and reliability of infrastructure.
SREs guide the engineering team on how an application or service should be observed and monitored, including what parameters should be looked at, what alerts they should have, and what routing path the alerts should go through. An SRE instructs on best practices, explores the tools that are being used and determines how to gather reliable data with consistency throughout the organization.
Many organizations we work with say they want to do SRE this way, but they’re not there yet. They are still stuck on monitoring every single metric they can find.
Observability ≠ Monitoring
Similarly, a lot of companies I meet say that they do observability, but what they are doing is just monitoring. They monitor every single CPU of every node of every pod of every machine that is running. They have alerts for some of these, and they may even have a playbook for some of them. This is not how SRE is supposed to work, and it’s certainly not what observability is all about.
More importantly, it’s not scalable as an organization grows to hundreds or thousands of developers and different teams that all share the same IT environment.
Two Viewpoints: Horizontal and Vertical
Now that we’ve established that SRE and observability are alike in that neither should be equated with monitoring, let’s contrast the two. SRE and observability are different in that they have different perspectives on the IT domain.
SREs see the world horizontally. They conceive of their IT domain on a horizontal plane where they can observe the capacity of resources being used. They see the Kubernetes cluster, and they see the cloud utilization. And they care about ensuring that what is deployed into the production environment is secure.
Although they see the pieces they manage on the horizontal infrastructure plane, they don’t see the ultimate impact on the application or customer on the planes “above.”
Observability is looking at the world vertically—that is, seeing how an application is or isn’t meeting customer needs. The whole goal of observability is to know if my application is serving my customers at the service level that I agreed to. With a vertical view from application up to customer needs and business impact, I don’t care how many pods are being run, how many nodes are being deployed or what comprises the cloud infrastructure underneath my application, as long as it works.
I do care about security insofar as I want to avoid introducing security vulnerabilities into the organization. But beyond that, all I need to know is that the application can meet the customer’s expectations and the service level objectives (SLOs) that we have established are being met, as well.
Therein lies the heart of the matter: For a company to make the transition to observability, it needs to build the observability system around establishing and meeting SLOs.
The Challenges of Transitioning to a Vertical Perspective
Companies must move beyond an SRE and DevOps organization to being an observability and SLO organization. Only by doing this will organizations be capable of understanding the trade-offs within their IT environments between cost, security and availability and the ultimate impact those trade-offs will have on business objectives.
Today, making this transition — moving beyond SRE to observability — is difficult for a wide variety of reasons, not the least of which are:
Alert Fatigue. Companies today are monitoring too many things they shouldn’t care about, and as a result, engineers are suffering from alert fatigue. With this, you face the danger of a higher probability of missing the alerts that you do care about and making sure that you address them promptly.
Ideally, observability solutions eliminate the irrelevant and unimportant alerts, add context to alerts that might need investigating, and correlate response data and docs to accelerate and coordinate the appropriate response of on-call responders.
Tool sprawl. Many of the companies we work with use a wide variety of separate tools for observability. Observability requires access to the data all of these tools produce, and problems can persist if the tools operate in data silos.
However, the problem can be solved if there is unified data collection across the toolset, ideally using OpenTelemetry. Cross-tool and cross-stack insights are particularly important for security purposes. Another issue with tool sprawl is that true observability requires that these tools, which each typically address just one aspect of observability, somehow all work together.
Unfortunately, current systems and processes lack the “connective tissue” to offer a consolidated observability solution.
Complexity. As we’ve seen in the results of our annual DevOps Pulse survey, the mean-time-to-resolution (MTTR) is going up instead of down, despite the number of tools that are available and the advancements made by the industry in recent years.
Complexity is a large part of the problem, specifically concerning Kubernetes. In our 2023 DevOps Pulse Report, we found that almost 50% of respondents cited Kubernetes as their main challenge to gaining full observability into their environment.
The adoption of Kubernetes is happening at lightning speed—even faster and higher than the adoption of cloud technology. But it’s not just the adoption of one Kubernetes—there are so many different flavors of Kubernetes, whether it’s Kubernetes running on managed services like AWS EKS or AKS or Kubernetes running on serverless with AWS Fargate. And that’s just to name a few of many, many types of Kubernetes out there.
All of these options create flexibility, but wherever you create flexibility, you also create complexity. Complexity in the infrastructure increases the time it takes to troubleshoot and resolve issues.
Cost. Most of us have figured out that products that include every feature under the sun are a waste of money. We don’t need expensive and extraneous clutter. Likewise, gathering and processing too much irrelevant monitoring data adds significantly to costs. Investing in observability platforms that are overloaded with bells and whistles you simply don’t need is similarly costly.
The ideal observability platform taps AI and automation to help you easily see what’s most important to the business and respond quickly to core issues with precision and advanced insights.
Culture change. Going from monitoring to observability is akin to making a cultural quantum leap. Observability requires a different organizational structure and assigns different responsibilities to the teams. The transition requires change management skills and patient, thoughtful execution.
Though these challenges are significant, the observability industry is making rapid progress and organizations can expect to see solutions and technology providers that overcome these challenges shortly. So, be on the lookout for great opportunities coming down the pike.
In the meantime, I challenge you to change your perspective from horizontal to vertical. Set your organization on the path to move beyond DevOps and SRE to true observability, where your teams will not just be alerting and chasing issues for troubleshooting but working to establish real SLOs and driving toward those to enable the strategic goals of the business.