Metrics, Logs and Traces: The Golden Triangle of Observability in Monitoring

It isn’t surprising that the job of monitoring infrastructure and application environments has grown more complex as the underlying technologies evolve. Nor is it surprising that the need to get monitoring right has expanded as our business reliance on IT grows ever deeper, so it is important to have a good grasp of the core tools at your disposal.

What you might call the golden triangle of observability includes metrics, logs and traces. Each plays a specific role in infrastructure and application monitoring so you need to understand what they bring to the table.

Metrics

Metrics probably represents the most valuable of the three monitoring tools because:

So many resources are designed to spit out various bits of health and performance information (and there are loads of tools to collect this information).
They are efficient.
They are frequently generated.
They are easily correlated across elements of your infrastructure.

Everything from operating systems to applications generate metrics which, at the least, are going to include a name, a time stamp and a field to represent some value. Since so many resources come ready to tell us about themselves, metrics is an obvious place to start when it comes to monitoring.

Most all metrics will enable you to tell if a resource is alive or dead, but if the target is valuable enough you’ll want to be able to ascertain what is actually wrong with the system or going wrong. As you can imagine, the latter will require detailed information about what is happening inside the system, so called white-box monitoring that relies on internal instrumentation. The more rudimentary black-box approach draws conclusions about the health of a system based on externally visible indicators (is it responding to any system calls?).

But perhaps the most important thing to understand about metrics is that last bullet about being able to correlate metrics across infrastructure components. Given the complex interdependencies common to IT environments today, the ability to stitch together metrics to get a bigger picture view is a real time saver. And it becomes even more critical as we move to cloud-native environments because of the dynamic nature of cloud infrastructure and the ever-changing relationship between that infrastructure and the applications.

An initial challenge of harnessing metrics is the variety of the information available and the number of tools needed to collect and make sense of that information. Then there is the question of how you store data in so many formats from so many resources. But the resultant upside more than makes up for the effort required to figure out how to harness the information.

It is also worth noting that, given there is no standard API for metric collection, many organizations use agents to collect data that is either pushed to a central location for analysis or pulled by that central resource. Gartner says agents frequently referenced by customers include push agents Collectd and Telegraf, while Prometheus is cited as a tool to pull information from targets.

Logs

Logs are perhaps the second most important tool in the monitoring toolbox because virtually everything logs information about what they are doing at any given time. What’s more, logs tend to give more in-depth information about resources than metrics. So, if metrics showed the resource is dead, logs will help tell you why it died.

The problem with logs is there can be too much of a good thing. With everything in your environment tracking what they are doing and anxious to share that information, it is easy to see how that could result in a mountain of data. Instead of streamlining the monitoring process, you are simply creating a big new centralized haystack.

And like metrics, differences in log formats and the abundance of tools available to collect and make sense of logs, complicates the job of getting the most out of this rich trove of material. There are, however, a number of common tools used for collecting logs, such as syslog and open source tools such as Fluentd.

The trick to getting the most out of these tools is limiting what you collect to keep it manageable, and, where possible, to home in on common fields so you can more easily find the needles in the haystack.

Traces

Last but not least in the monitoring triangle is application trace data, which “traces” information about specific application operations. With so many application interdependencies these days, these operations will typically involve hops through multiple services (so called spans).

Traces, then, add critical visibility into the health of an application end-to-end. They are, however, solely focused on the application layer and provide limited visibility into the health of the underlying infrastructure. So, even if you collect traces, you still need metrics to get the full story of your environment. APM tools feed trace information to a centralized metrics store, so traces provide a nice source of data for an app-centric view.

The need for the viewpoint that traces can provide is exacerbated in container-based microservice architectures that are nothing more than a collection of stitched together services. These environments can be addressed with something called distributed tracing.

Do You Need All Three?

There is obviously overlap among metrics, logs and traces. But the quick answer to the question of whether you need all three types of monitoring tools is, It depends. The simpler your environment and the more tolerant you are of performance degradation and outages, the fewer tools you’ll need to keep things running. Basic metrics will probably work fine for you.

Conversely, more complex environments that have to be up and running at all times or need to be fixed as quickly as possible will require a mix of tools that answer more than the question of, “Is it broken?” Metrics and logs will be base requirements.

And finally, if your environment consists of a lot of intricately interwoven pieces, then adding traces will save you effort when it comes to birddogging problems. If you’re not there yet, but see containers and microservices playing a bigger role in your future, you’ll probably want to start getting familiar with trace tools today.

Keep in mind that each of these tools requires storage considerations. Yes, there are options to support multiple tools from the same repository, but you’ll want to consider those needs as you build out your monitoring repertoire. And ultimately, you’ll want analysis and alerting tools that can span the environments, so keep that in mind as you consider the options.

Together metrics, logs and traces make up the golden triangle of observability and will help you stay on top of the ever-churning IT world driving business today.

— Apurva Dave