Opportunities and Challenges of Observability

Observability platforms that unify metrics, logs and traces across applications and IT infrastructure are transforming how IT is managed. Most DevOps teams today are able to continuously monitor IT environments using tools that track a set of pre-defined metrics; observability platforms make it possible to aggregate data so that DevOps teams can launch queries that help them uncover the root cause of IT issues. DevOps teams leverage observability to allow them to understand what is happening within their applications more effectively and efficiently.

Historically, applications were black boxes—meaning it was (and still is, in many cases) extremely difficult to know what is happening under the hood of the application.

With the advent and subsequent growth of observability tools and techniques in the DevOps space, teams went from struggling to collect and understand the available data (which could vary from robust to almost non-existent) to easily aggregating data from numerous sources and being able to understand what that data means to being completely deluged with data—so much data it’s hard to keep up.

Gather Data is No Longer Enough

Josh Chessman, vice president of strategy and innovation at IT infrastructure monitoring specialist Netreo, explained that as applications become more and more complex, (e.g., running across multiple disparate environments with different technologies and in different locations) being able to collect data from these different environments is no longer enough.

“Instead, DevOps engineers need to be able to collect the data and efficiently analyze it, too, to understand what is happening within the application,” he said. “Observability is the concept that drives this.”

Chessman pointed out there are numerous observability issues DevOps teams face, but that they vary from organization to organization (and sometimes from team to team within an organization).

“Part of the problem is that many organizations simply lack tools to provide the data collection and analysis needed to move to an observability paradigm,” he said.

This could be for numerous reasons, including budget (if you already have an existing tool, can you afford to acquire a new tool?), functionality (not all tools that claim to provide observability really do), implementation (this stuff is not easy and a poorly implemented tool could actually make things worse) and more.

“One of the larger challenges I’ve seen is that organizations will go out and acquire an ‘observability’ tool and assume that is all they have to do,” Chessman said. “Unfortunately, that is not the case. Acquiring the tool gets you the technology, but the team must also be structured and ready to take advantage of the concepts and functionality provided by the tool.”

GitLab’s senior director of product management Kenny Johnston explained that, on average, DevOps teams use five different observability tools which tend to inadequately connect developers to observability.

In turn, developers struggle to respond to incidents swiftly while jumping across multiple tools and sources of truth and struggle to assign service level objectives (SLOs) and error budget definitions to the responsible product development teams.

“To remedy this, we recommend adopting a unified software development solution that replaces the DIY ‘duct tape’ of products constraining developers and helps them deliver software more securely, at speed and scale,” he said.

Johnston said the most flexible and future-proof solutions today are open source tools like Prometheus and its ecosystem, as well as the OpenTelemetry toolset for more advanced use cases such as tracing.

Prometheus has become very common in the developer space and, unlike other open source solutions or libraries, its format automatically emits metrics.

“This allows for simple and future-proof interoperability, as one of the hardest things to change later down the line is code running,” he said. “Using open standards reduces the need for developers to rewrite and reintegrate the same type of code over and over again.”

Observability Requires Understanding Your Goals

From Chessman’s perspective, understanding what your goals are is critical for observability.

“Observability tools can do many things, but their primary function is to collect and analyze data,” he said. “While that is all well and good, if you do not define your KPIs from the beginning, all the collected and analyzed data is meaningless since you have no concept of what good and bad are.”

Trying to define good and bad from the data collected is likely to result in failure because you are looking at where you are to figure out where you want to go.

KPIs vary from application to application and, if you do not have them identified and well-defined before beginning down the observability path, it is that much harder, he said.

Most tools (observability or otherwise) are capable of collecting hundreds if not thousands of metrics, but not all—not even most, in many circumstances—will be relevant to an organization.

“Instead of having to sift through the cruft to find the valuable data, it is better to start with identifying the valuable data and then figure out how to get there,” Chessman said. “If you don’t identify your KPIs in advance you could end up using the wrong KPIs for the wrong scenarios.”

Johnston added that KPIs, like DevOps, are an iterative process and starting with KPIs helps developers create systems that can meet performance and availability expectations.

“It’s important that DevOps teams continuously review their KPIs to allow for regular feedback on successes and failures,” he said.

From his perspective, the best place to start for maintaining and improving observability is by storing observability definitions, instrumentation, SLOs, thresholds, incident response templates and dashboards as code.

“This allows for rapid improvement via code changes by any team member, whether that’s a site reliability engineer (SRE), developer or platform team,” he said.