Cloud Native Tracing and Observability: Why You Care

Is observability the new monitoring? Or is observability, and tracing, fundamentally different? Like any IT industry trend, it can be difficult to discern as many jump on the trend bandwagon, appropriately or not. The Splunk .conf 2019 event presented Splunk with the opportunity to bring clarity to the terms and the reasoning behind the acquisitions of SignalFx and Omnition.

Monitoring is mostly about taking events, telemetry data and established data points and thresholds into monitoring, alerting and problem management processes and tools. Its tried and true, and has progressed with the evolution of typical monolith, distributed apps and systems. Enter cloud-native applications, composed containers, microservices and service meshes. With great flexibility comes complexity, and cloud native isn’t immune to such an adage.

What would have been a monolithic application is now shattered into hundreds or even thousands of smaller pieces of application and app technology software. The complexity of externally monitoring cloud native quickly surpasses the capabilities of most monitoring approaches.

Observability, as recently acquired SignalFx CTO Arijit Mukherji shared this week, is built upon three pillars: metrics, tracing and logs. Metrics show when you have a problem, tracing points you to the problem and logs help find the root cause—a reasonable way to define and segment observability. It also, not surprisingly, fits well with the logic of Splunk plus SignalFx plus Omnition.

To move beyond monitoring requires instrumentation, built into the software and APIs as part of the software creation process, a DevOps process. SignalFx in part brought Splunk auto-instrumentation, which during run-time, identifies frameworks and libraries in use within applications and can capture tracing instrumentation. Omnition brings even deeper tracing chops to perform tracing across large service meshes of microservices. Add Splunk’s capability to correlate data across the business, applications and operations data and you complete the picture with the ultimate goal of making observability easy for developers.

The above might explain why developers, ops and DevOps care about observability and tracing, but should the business care? SignalFx CEO Karthik Rau connected the dots nicely during a conversation this week. Digital transformation strategies require speed and agility, but also demand more risk-taking. Confidence in risk-taking comes when accompanied by the ability to respond rapidly to changes and failures. A software deployed 10 minutes ago may need to be backed out or corrected rapidly when the users’ experience goes negative. That requires a rapid determination of what the problems is, and the ability to take immediate corrective action, including automated action.

Cloud native and DevOps not only enable disruptive, digital transformation strategies but must be accompanied by rapid and automated responses when negative business and customer impacting conditions occur. Customers don’t care when CPUs are taxed, but they do care when a mobile app’s responses fail or are slow. The move to cloud native is served well when backed up by such capabilities to respond in near real-time when problems occur.

— Mitchell Ashley