Site reliability engineers, or SREs, do many things. They help developers build reliability into applications. They manage SLAs and SLOs. They play a leading role in incident management and incident response.
For all of these tasks, SREs draw heavily on observability and monitoring. Although other parts of the IT organization also typically help to manage observability and monitoring, it’s virtually impossible for SREs to do their jobs effectively in today’s cloud-native world without observability and monitoring tools and data.
Here’s what observability and monitoring mean, why they’re so important for SREs in particular and how SREs can systematically measure the success of their monitoring and observability practices.
Monitoring is the collection of data from applications, infrastructure or other IT resources to track the status and performance of those resources. Monitoring has been a core IT practice since the late 1990s when the first modern monitoring tools, such as Nagios, appeared.
When compared to monitoring, observability is more intensive. Observability is the collection of discrete types of data from multiple systems and the correlation of those data sets to achieve actionable visibility into IT resources. One thing to note: Observability has a more technical definition in the field of control theory that has to do with using external outputs to make inferences about the internal state of a system; however, when people talk about observability in the context of IT and SRE, they usually mean the collection and correlation of broad sets of data in order to understand complex systems.
The debate about the differences between observability and monitoring is ongoing, but the consensus is that they boil down to:
In most respects, monitoring is one step in the greater observability process. But observability involves additional practices, such as the correlation of various types of data, to achieve a level of context and actionability that monitoring alone can’t deliver.
Again, observability has become a core practice for engineers in a variety of roles. Developers can use observability tools to help measure and optimize application performance before deployment. IT engineers can leverage observability to gain visibility into issues that exist in a production environment. Observability tools can help QA engineers determine why an application failed a test.
But no group has more to gain from observability than SREs. Given that maximizing the reliability and performance of systems is the core mission of SREs, the ability to not just detect problems via monitoring but also to understand them through observability is critical for modern SRE teams.
That’s all the more true for SREs tasked with managing reliability for complex systems, like those based on microservices. In complex applications, identifying the root cause of a reliability issue—such as which specific microservice contains buggy code or which type of request triggers a performance bug—can be very difficult. By maximizing context into reliability problems, as well as allowing teams to trace requests across complex, distributed systems, observability places SREs in the strongest position to find and fix reliability problems quickly.
What’s more, beyond detecting issues in systems that have already been deployed, SREs can also leverage observability tools to help build reliability into applications before they are put into production. By observing applications in dev/test environments, SREs may be able to identify reliability risks and then use that insight to find ways to make the application inherently more reliable. Observability might reveal reliability weaknesses that stem from an application’s architectural design, for instance, or from the orchestration tool that manages the application.
Getting the most from observability and monitoring requires systematic measurement of observability and monitoring initiatives. It’s only by collecting metrics about observability and monitoring outcomes that SREs can ensure they are improving reliability rather than merely collecting and analyzing data to no particular end.
While a full discussion of observability and monitoring measurement is beyond the scope of this article, consider the following types of metrics for assessing these practices:
The ultimate goals of SREs extend beyond observability and monitoring. However, observability and monitoring serve as essential means to the broader ends that SREs are tasked with achieving: Maximizing uptime and performance. And while monitoring alone was once sufficient for supporting the goals of SREs, it’s very difficult to imagine an SRE team today that doesn’t also make extensive use of observability, which takes monitoring to the next level.
Redis is taking it in the chops, as both maintainers and customers move to the Valkey Redis fork.
GitLab Duo Chat is a natural language interface which helps generate code, create tests and access code summarizations.
Expect attacks on the open source software supply chain to accelerate, with attackers automating attacks in common open source software…
The emergence of low/no-code platforms is challenging traditional notions of coding expertise. Gone are the days when coding was an…
Datadog today published a State of DevSecOps report that finds 90% of Java services running in a production environment are…
Linux dodged a bullet. If the XZ exploit had gone undiscovered for only a few more weeks, millions of Linux…