OpenTelemetry and AI are Unlocking Logs as the Essential Signal for “Why”

Logs have always been a cornerstone of system reliability. Metrics tell you the “what”, the system’s heartbeat, indicating when a threshold like high CPU or error rate, has been breached. Traces are essential for identifying the “where,” mapping a request’s journey to pinpoint the service where an error originates. But logs tell you the “why”, they are information-rich and capture the smallest details of what is happening inside applications and infrastructure, making them indispensable for troubleshooting and root cause analysis. Yet in today’s cloud native landscape, logs are also becoming one of the hardest signals to manage. Kubernetes, microservices, ephemeral workloads and now agentic AI applications generate unprecedented volumes of log data. With this scale comes inconsistency in formats, fragmented context, and pipelines under pressure. SREs are left with a dilemma: Either retain everything and risk drowning in data and cost or filter aggressively and risk discarding the single line that could unlock the root cause of the next outage.

The Evolving Role of OpenTelemetry

This challenge has made OpenTelemetry (OTel) increasingly important. Originally focused on traces and metrics, the project is now rapidly maturing to also have logs on an equivalent footing.

Schema: The community is standardizing log data models and semantic conventions. Elastic has contributed the Elastic Common Schema (ECS) to accelerate OTel Semantic Conventions, and OTel’s GenAI SIG is extending semantics to cover AI Agents, LLMs, and VectorDBs. These efforts aim to reduce variability and enable consistency across diverse log sources.

Transport: OTLP has been defined as the common transport for logs, providing a unified way to move telemetry data across systems without vendor lock-in. This ensures that logs, metrics, and traces can all share the same delivery mechanism.

Collection: The OTel Collector has been extended with receivers for file-based logs, system journals, and Kubernetes workloads. With Kubernetes-aware operators and Helm charts, practitioners can deploy pipelines either agent-side or centrally, depending on scale and architecture needs.

For practitioners, these advances ease many of the plumbing challenges: logs from containers, cloud services, and custom applications can flow more consistently into unified pipelines. While collection is becoming more reliable and flexible, the harder challenge remains turning raw log streams into real understanding.

Logs: The Essential Path to Answering the ‘Why’

For much of the last decade, industry efforts around logging centered on collection. Forwarders, agents, and storage pipelines multiplied as organizations sought to capture every event. Today, with OpenTelemetry, collection is becoming easier and standardized.

The real challenge lies in turning raw, unstructured text into actionable context and real answers for SREs. SREs need systems that can:

Automate the data management pipeline – parse logs automatically, partition them into meaningful groupings, and analyze them for critical signals. Without this transformation, logs remain a noisy firehose rather than a source of clarity.

Simplifying the variability in logs – Kubernetes alone produces dozens of log formats depending on runtime, workload, and sidecar configuration. Add to that custom application logs, AI inference logs, and security signals, and the diversity becomes overwhelming.

This is where AI and Large Language Models (LLMs) deliver a crucial leap in investigative power. LLMs move beyond brittle static rules by offering dynamic pattern recognition and adaptive parsing. For instance, an LLM can analyze logs from a new application or service, automatically recognize patterns (even with slight format drift), and suggest or apply a structured schema on the fly, transforming raw text like [ERROR] User:123 failed login: IP 192.168.1.1 into a structure with fields for SeverityText, attributes.user.id, body, and attributes.client.address.

Furthermore, Agentic AI takes this a step further: it can actively monitor for log anomalies, identifying which events are truly significant (e.g., distinguishing a critical application error from routine service noise). The Agentic system then automatically correlates these critical logs with related metrics spikes, trace failures and recent deployment changes to propose the likely root cause and generate a plain-language summary of the incident before an engineer is notified. This transformation moves log analysis from a manual search operation to an automated, intelligent investigative process.

Consider the experience of an SRE on call at midnight. Instead of wading through thousands of unstructured log lines, they should see clear, contextual answers: a service restarting repeatedly due to memory exhaustion, an error-rate spike tied to a recent deployment, or an anomaly in startup failures. Behind the scenes, parsing, partitioning and enrichment must happen automatically. AI can adapt as formats evolve, while significant events are surfaced in real time. The SRE gets clarity instead of chaos.

Storing OTel Semantics Natively

While AI can help remove the variability and simplify the pipelining, the ability to use an open convention on semantics and store logs in their native OpenTelemetry semantics is vital. Instead of forcing logs into proprietary formats or custom schemas, teams can persist them in a standardized, OTel-aligned representation. This shift provides two major benefits.

First, it ensures consistency across signals. When logs share the same semantic conventions as traces and metrics, cross-signal correlation becomes straightforward. A log line tied to a trace ID or a Kubernetes namespace doesn’t require additional translation layers; it already speaks the same language as the rest of the telemetry.

Second, it accelerates analysis. Because the data model is standardized, queries and visualizations don’t need to reinvent mappings for each source. Schema evolution also becomes less disruptive: if a log payload changes, downstream systems still understand its meaning within the OTel conventions. For SREs, this means faster querying, fewer blind spots, and quicker identification of anomalies across diverse workloads. By persisting logs in native OTel semantics, organizations build a foundation where log data is not only collected but also immediately usable in broader observability workflows.

From Signals to Understanding

AI is increasingly helping to automate the messy work of parsing, normalizing, and surfacing insights from raw logs. By adapting to changing formats and highlighting significant events, AI reduces noise and makes log data more actionable. A clear example is Elastic’s Streams, which applies AI-driven processing directly within pipelines to parse, normalize and enrich logs before surfacing insights, combining normalization with enrichment and surfacing context directly to practitioners. When paired with native OTel storage, SREs can shift their focus from raw collection to answering the “why” behind failures. Together with OpenTelemetry’s maturing semantics and consistent data models for cloud native and agentic AI, cross-signal correlation is becoming faster and more reliable than ever.

By treating logs not simply as raw text but as structured, enriched and intelligent data, these innovations move observability closer to becoming an adaptive system rather than a static toolset. Ultimately, embracing the richness of logs, the speed of metrics and the precision of traces, woven together through open standards and AI-driven analysis, will enable SREs to move beyond collection toward true understanding and keep pace with the complexity of modern, autonomous, cloud native systems.

KubeCon + CloudNativeCon North America 2025 is taking place in Atlanta, Georgia, from November 10 to 13. Register now.