Walking into an unfamiliar operations center some time ago, I immediately noticed database error alerts racing down the primary monitor faster than a Matrix screensaver morphs letters. Strangely, no one seemed particularly excited about it. The situation was a head-turner for me, coming as I did from the “everything must balance” banking world.
A senior ops tech quickly explained that airline reservation systems are highly optimized for speed and volume, sacrificing nearly everything else to meet peaks. “No worries; the agents just hit Enter again. Retry logic just slows us down,” they told me. The lesson here was removing constraints can profoundly change a system, much as Dr. Eliyahu Goldratt described in his game-changing Theory of Constraints.
Today, cloud-native architecture, multi-cloud and hybrid cloud platforms, dynamic infrastructure-as-code (IaC), DevOps and our ability to store boundless amounts of data each change or remove constraints in their own way. Observability could be on this list, but I think observability’s transformation is still happening right in front of us.
Observability and OpenTelemetry
Through open source OpenTelemetry (OTel), highly competitive vendors collaborate to shift their offerings up the value chain by supplementing and replacing once highly-prized proprietary agents, interfaces and data specs. There’s no better example of removing a constraint than making these once proprietary agents with an open solution that will have profound effects across the industry.
OTel is a thriving ecosystem of 11 tech companies participating on various boards and committees that also collaborates and integrates with open source projects such as Jaeger, Kubernetes, Prometheus and OpenMetrics. Over 20 observability companies natively support and provide OTel distributions. OTel continues to mature, reaching stability 1.0 release in 2021 and announcing its roadmap for metrics specifications in 2022.
OTel-native support represents a significant commitment from vendors because it means rearchitecting products to distribute data via OpenTelemetry Protocol (OTLP) rather than using OTel SDKs and Collectors frontend internal interfaces. With SolarWinds’ October 2022 announcement, I suspect they will also join the ranks of companies supporting OTel natively in their commercial offerings.
Digital transformation, product and technology leaders see value in observability because of its potential to measure digital experiences and measure the performance of business and digital services. To do this requires observability to meet three significant challenges.
First, observability must effectively cross the complex boundaries of microservices, containers, cloud and traditional applications, multiple cloud providers, database sources, SaaS services, infrastructure and internal and external APIs. Today’s challenge is far beyond the central aggregation of large volumes of log data and suppressing non-essential alerts.
Most enterprise architectures look eerily similar to a breadboard wiring project with applications, systems and data sources crisscrossing each other, representing the various pathways and interfaces across systems. Virtually any of these elements could contribute to the degradation of a digital experience, and observability must operate across these elements whether they live in our tightly controlled data centers or are distributed in microservices, cloud services or third-party interfaces.
With this breadth and depth of visibility, we also need context to match and correlate what appears to be disconnected information and sources. Open source enthusiast Chris Engelbert describes this challenge well:
“The data correlation, the knowledge of how the infrastructure and services are deployed, as well as the dependency tree of applications, services and (eventually) hardware, must be taken into account when providing hard evidence of what is going on in your system.”
Following the Silver Thread
With context and dependencies, observability can allow us to see what software developers call the “silver thread,” the ability to collect the components of systems involved in measuring an experience or triaging a performance issue. Then we can follow the pathway, or silver thread, across all the components, whatever or wherever they are, to find constraints or bottlenecks. For example, a particular API’s poor performance may be due to the location and traversal necessary to reach a needed data source rather than an issue in that API’s microservice or application.
In summary, observability is invaluable to monitoring, operating and triaging modern applications and cloud infrastructure. The adoption and maturation of OpenTelemetry will deliver many benefits, including the removal (or, at least, greater transparency) of traditional boundary constraints. And with the context to understand and follow the silver thread, observability transforms beyond operations to measure digitally delivered business services and experiences.