In today’s technology landscape, organizations strive to champion innovative ideas, techniques and technologies to achieve success and outshine their competitors. For this reason, site reliability engineering (SRE) has become one of the fastest-growing enterprise roles and a set of organizational practices for fast and reliable software delivery.
SREs use various tools and practices at their disposal to manage services at scale, such as observability. Observability is the ability to infer a system’s internal state(s). It provides actionable insights into when errors occur within a system and, more importantly, why they occur. For SREs, this actionable data can be important to providing secure and reliable applications.
To gain more insights into the benefits of observability for SREs, I asked SKILup Day participants and DevOps Institute Ambassadors to weigh in. Here’s what they shared:
Sponsor, Priya Satheesh, CEO, Instana
“As their name implies, site reliability engineers are tasked with keeping the applications, architectures and websites up and available. Observability solutions put alerts in front of them at the first indications of trouble, with proper context and information so they can take action before issues become incidents and start affecting their customers. In addition, AIOps and machine learning help them to predict events and standardize responses for common occurrences to reduce response and repair times.”
Helen Beal, chief ambassador, DevOps Institute
“Embracing observability supports SREs’ goals by:
- Reducing the toil associated with incident management—particularly around cause analysis—improving uptime and MTTR.
- Providing a platform for inspecting and adapting according to SLOs and ultimately improving teams’ ability to meet them.
- Offering a potential solution to improve when SLOs are not met and error budgets are overspent.
- Relieving team cognitive load when dealing with vast amounts of data–reducing burnout.
- Releasing humans and teams from toil while improving productivity, innovation and the flow and delivery of value.
- Supporting multifunctional, autonomous teams and the ‘We build it, we own it,’ DevOps mantra.
- Completing the value stream cycle by providing insights around value outcomes that can be fed back into the innovation phase.”
Mark Peters, technical lead, Novetta
“True system observability means any process or event within any operational or development phase can be monitored by SREs. Systems with built-in observability provide SREs, as subject matter experts, the opportunity to examine events from different perspectives. The more observable the system, the more different places can be examined to help improve the overall flow. SREs can then create the feedback and design experiments to observe from different perspectives. If the event is showing functional difficulties and the only monitoring point is the input/output of the function and there is no process observability, SREs have to go back to the beginning to find events. If they can observe the entire process chain from commit to deployment, one can find the different areas. Think about the A/B cycle: Limited deployments are planned, and the load shifted gradually to ensure the new system can handle the increased functions. If the SRE cannot observe the function, then they cannot help improve reliability.”
Ryan Sheldrake, field CTO, Lacework
“If you can ‘see’ or observe the complex and ever-changing environments that SREs are tasked with managing, creating and perhaps destroying, you can manage the associated risk posture associated with those environments and corresponding workload.“
Vishnu Vasudevan, head of product engineering and development, Opsera
“The top benefit for site reliability engineering (SRE) teams is the opportunity to work collaboratively with the business stakeholders and technology stakeholders to set goals and continuously improve. Based on these goals, SRE’s can create an approval process, which will help them understand where they are with respect to their own planning, security policies, quality metrics and operations practices. The ability to measure every release and verify the project helps improve speed to market and other factors once a robust observability practice is achieved. With DevOps orchestration across all the DevOps tools, applications being released and the different teams involved, SRE teams can then turn observability practices into a competitive advantage.”
Sushant Mehta, senior manager, application development, Diyar United Company
“Using observability, SREs can achieve a number of objectives in their day-to-day work, like identifying the root cause of production issues, faster resolution of issues and striving for self-healing infrastructure setup with no-code.”
Tiffany Jachja, engineering manager, Vox Media
“Every team that delivers software should be taking responsibility for their contributions to a software service. For SREs, having observability into an application means having the systems in place to help developers gain insights into how software applications function. Ultimately, it’s about building better software.”
Parveen Arora, co-founder and director, VVnT SeQuor
“SREs teams can leverage observability to get the following benefits:
- To keep SLOs intact by detecting customer-affecting issues faster and rolling back before an issue affects the SLOs.
- To have real-time health updates and transparency of information with regard to a service’s status.
- To create better workflows for debugging, optimizing workflows and resolving issues rapidly.
- To simplify root cause analysis and investigation of hypotheses.”
Jose Adan Ortiz, solutions engineer, Akamai Technologies
“SREs are mainly concerned with the reliability of systems and usually monitor practices of toil, SLO, error, budget and stability to maintain systems working as expected.
You could ask, ‘How can an SRE achieve such an integrated view of all systems components?’ The answer is observability. Without observability tools, it was impossible for an SRE to provide an effective response, fast RCA and analysis of data to improve SLOs.”
Supratip Banerjee, solutions architect, Principal Global Services
“It can provide several advantages, including:
- Detecting customer-impacting errors faster and reverting before SLOs are broken.
- Fostering transparency and delivering real-time updates on the status of a service. This allows SREs to be more productive and saves a lot of time.
- Developing better methods for debugging, optimizing and quickly addressing issues.
- Allows for feedback loops, which are crucial in the SRE role.
- To avoid downtime, the relevance of observability rises in real-world production systems. There should be proper notification processes in place.”
Maciek Jarosz, DevOps and process expert
“The benefits are quite great, I’d say. Unless one is living under a rock with one local system, then it’s rather inevitable that there will be dependencies here and there. And if you’re an SRE, then I’d say that you’d rather know a bit more about dependencies from other branches of your system—or other systems altogether—than less.
Learn more about observability and similar topics by registering for an upcoming SKILup Day.