DevOps Practice

The Challenge of Scaling Metrics in Distributed Systems

Whether migrating from a monolithic architecture or building a distributed system from ground zero, there are many benefits to leverage from a microservices architecture–faster software deployment cycles, enhanced scalability, improved isolation of risk. However, distributed systems are complex and as such, accompanied by many challenges, particularly when it comes to the scalability of metrics.

The Importance of Observability in Distributed Systems

In order to realize the benefits of a distributed system, a strategy of data-driven optimization is essential. The paradox here is visibility simplifies complexity, while the complexity of visibility is dramatically increased in distributed systems.

As New Relic points out in their eBook Cloud Native Is the New Normal, traditional monitoring approaches need to be adapted to meet the demands of decentralized architectures, including:

  • Overcoming the loss of end-to-end monitoring capabilities by learning how to connect the dots across a complex network of environments.
  • Developing tactics to quickly drill down into issues across interconnected parts to reduce mean time to resolution (MTTR).
  • Introducing proactive reliability strategies that focus on measurement first, followed by improvement.

Finding the Right Balance Between Monitoring Nothing and Monitoring Everything

Most application engineers leverage some sort of metrics tooling such as Grafana, Prometheus or Graphite to track vitals for their system health. While monitoring is essential, there is such a thing as bloat when it comes to metrics ingestion; and this overload translates to scalability challenges that affect performance, cost and reliability.

So, how do you know what to monitor? In order to derive the most valuable insights from your system, you must first understand and identify the components, and more specifically which attributes within those components drive key metrics. Knowing this requires an intimate familiarity with how all the pieces work together in your distributed system.

Google’s Site Reliability Engineering book provides a framework for determining what to monitor in distributed systems. They break signals down into four different categories:

  • Latency: Understand resource requirements and identify bottlenecks.
  • Traffic: Monitor system response to varying levels of load; most useful when paired with other indicators.
  • Errors: Track frequency and nature of interruptions in request/response flow.
  • Saturation: Surface capacity-related issues.

Even when you’ve identified the most important metrics to track, scalability is still a challenge that must be addressed, particularly when dealing with applications that handle massive volumes of data across a complex application architecture.

M3: Uber’s Open Source Large-Scale Metrics Platform

A solution directly addressing the challenges of distributed metrics at scale is M3, an open source project developed by Uber’s engineering team.

Before using M3, the Uber team sent metrics to a Graphite stack with storage handled using the Whisper file format in a shared Carbon cluster. From there, they created Grafana dashboards to visualize the data and used Nagios for alerting based on Graphite threshold checks via source-controlled scripts. As the volume of metrics continued to expand, manual processes to scale the solution became unsustainable and repeat failures resulting in permanent data loss led them to build a more fault-tolerant, scalable system.

M3 is focused on providing a reliable storage solution for large-scale metrics over long retention periods, solving inherent single node-associated limitations with Prometheus by offering a multi-tenant store for Prometheus metrics. A key component of this solution is M3DB, the scalable storage backend for M3.

M3DB, built in Go, is a distributed time series store and reverse index with configurable out-of-order writes. By enabling the storage of real-time metrics over long periods of retention, M3DB can replace a Prometheus cluster. In developing the M3DB solution, the Uber team essentially deconstructed Prometheus into microservices to enable the scaling of each component individually. By taking this approach, M3DB is able to offer better reliability, efficiency and performance.

At a minimum, M3DB can be leveraged as remote storage for Prometheus (or used to replace any other remote storage solution). There are, however, other features the solution offers, including an external aggregation engine and a separate query engine.

The M3DB project is still under development with a growing community of contributors.

To learn more about containerized infrastructure and cloud native technologies, consider coming to KubeCon + CloudNativeCon NA, November 18-21 in San Diego.

Molly Wojcik

Molly Wojcik

Molly Wojcik is director of education and awareness at Section, a DevOps-centric Edge Compute Platform that empowers web application engineers to run any workload, anywhere by simplifying distributed computing at scale. With over 12 years of experience in marketing leadership roles, Molly is particularly passionate about early-stage category education and awareness to drive technology innovation and adoption. In addition to her role at Section, Molly helps lead the Landscape Working Group within the LF Edge Glossary project, an open source project that aims to mature the lexicon within the edge compute landscape and establish a definitive ecosystem map.

Recent Posts

Building an Open Source Observability Platform

By investing in open source frameworks and LGTM tools, SRE teams can effectively monitor their apps and gain insights into…

9 hours ago

To Devin or Not to Devin?

Cognition Labs' Devin is creating a lot of buzz in the industry, but John Willis urges organizations to proceed with…

10 hours ago

Survey Surfaces Substantial Platform Engineering Gains

While most app developers work for organizations that have platform teams, there isn't much consistency regarding where that team reports.

1 day ago

EP 43: DevOps Building Blocks Part 6 – Day 2 DevOps, Operations and SRE

Day Two DevOps is a phase in the SDLC that focuses on enhancing, optimizing and continuously improving the software development…

1 day ago

Survey Surfaces Lack of Significant Observability Progress

A global survey of 500 IT professionals suggests organizations are not making a lot of progress in their ability to…

1 day ago

EP 42: DevOps Building Blocks Part 5: Flow, Bottlenecks and Continuous Improvement

In part five of this series, hosts Alan Shimel and Mitch Ashley are joined by Bryan Cole (Tricentis), Ixchel Ruiz…

1 day ago