How to Migrate an Observability Platform to Open-Source and Cut Costs

Achieving full observability across enterprise applications is challenging. While third-party services such as New Relic or Datadog make end-to-end observability easier, as your application grows in complexity, so does the telemetry data, leading to increased costs. Migrating to an open-source stack is a better solution as it gives you control over telemetry data and reduces observability costs, despite challenges with existing service provider commitments.

Migrating to open-source solutions is straightforward with simple architecture and tech stacks. However, with many microservices, diverse languages and frameworks and heavy cloud reliance, transitioning the entire observability stack to open-source demands meticulous planning, tools/frameworks assessments, testing, risk analysis and cross-team communication.

For the past two years, I have been working on migrating the observability platform from a proprietary solution provider to an open-source stack. A cloud observability platform usually absorbs 20–30% of overall infrastructure spending, but in some cases, it can reach 50–60%. If your observability expenses exceed 50%, then transitioning to open-source solutions would be a sensible option to mitigate costs. Let’s dive into the key steps necessary for such a migration:

1. Finalize the Key Telemetry Data and Systems

Managed observability platforms offer comprehensive insights into system health, necessitating a significant volume of high-quality telemetry data. By default, the agents tasked with capturing and transmitting telemetry data are configured to gather as much information as possible, facilitating the creation of extensive dashboards and reports. However, this extensive data collection contributes to the overall cost.

In my experience, the quantity of telemetry data required is significantly lower than what third-party services typically retrieve. You can easily divide them into two buckets: must-have and nice-to-have. The ‘must-have’ data aligns with the well-established ’four golden signals’, and they are:

• Latency – The amount of time taken to serve a request
• Traffic – The volume of requests that a system is currently handling
• Error Rate – The number of requests that fail or return unexpected responses
• Resource Saturation – The percentage of available resources being consumed.

Observability platforms built on these signals can cover 80% of typical use cases. In most cases, these fundamental signals serve as the foundation for key dashboards and alerts.

Understanding the metadata requirements for each signal is also crucial. For example, when monitoring service-to-service latency, considerations such as the necessity for service IP addresses, instance IDs or header information arise. The extent of metadata captured directly impacts the complexity and cost of the observability platform.

Furthermore, selecting which key systems to monitor is a critical decision. A typical enterprise application includes systems such as databases, caching databases, container orchestrators and numerous cloud services. Engineers often monitor more services than necessary, leading to unnecessary complexities and costs. Streamlining the monitoring scope to essential systems is often sufficient.

2. Select the Relevant Stack

Managed observability platforms such as New Relic and Datadog provide comprehensive monitoring for various enterprise components. However, when migrating to an open-source stack, you must assess and integrate various tool stacks to meet diverse monitoring requirements. One of the most important aspects you must consider is scaling — how do you plan to manage the huge amount of data generated every minute across various systems?

To address scaling challenges, you must focus on two fronts — selecting the stack for storing and processing telemetry data (logs, metrics and traces) and devising methods for capturing and pushing telemetry data from diverse systems.

Telemetry Signal Stack Selections

Stack for Logs

You require a stack that can efficiently and economically process and store the substantial volume of logs generated by the system. In the past, I utilized the ELK stack for storing and searching logs, but it is a generic solution and not tailored specifically for logs.

I recommend Grafana’s Loki for its effective management of large log volumes and LogQL, a language akin to PromQL. If you are familiar with PromQL, navigating through extensive log data with LogQL will be easy and intuitive.

Stack for Metrics

Prometheus is one of the commonly used, battle-tested and popular databases for storing time-series metrics data. However, it has scaling-related limitations — it cannot scale horizontally. Alternatives such as Thanos, Grafana’s Mimir and VictoriaMetrics offer better out-of-the-box solutions for horizontal scaling.

I conducted some research and ultimately chose Grafana’s Mimir to store metrics in a similar migration project. This decision was based on Mimir’s remote storage capability, as well as its scalable and highly available architecture.

Stack of Tracing

In a microservices architecture, distributed tracing is a must for easily identifying bottlenecks in latency and resolving performance-related bugs. Grafana’s Tempo could be a good option for several reasons.
• It can seamlessly integrate as a backend for Grafana’s dashboard. Consolidating dashboards in Grafana helps avoid navigating between multiple applications. Switching platforms only for tracing while using Grafana for other aspects of observability would not be a practical solution.
• It provides a similar experience to TraceQL when interacting with the database, streamlining the learning process and significantly reducing the learning curve.
• It is compatible with popular open-source tracing protocols including Zipkin and Jaeger. This means if a team is already using these protocols, transitioning to Grafana’s Tempo would be smoother and more familiar as they are already using the same protocol.

Capturing and Pushing Telemetry Data

To capture and push the telemetry data, installing an agent on the system is necessary. The steps for installing the agent vary depending on the type of system being monitored. Typical enterprise application setups involve monitoring cloud integrations, process monitoring, infrastructure hosts, Kubernetes clusters and application monitoring.

Cloud Integrations: This includes monitoring for cloud services such as SQS, SNS, EMR and EC2.

Process Monitoring: This involves monitoring processes running on bare-metal machines. With the advent of dockerization and Kubernetes, there is now a standardized way of starting the application. However, in the past, there was no fixed mechanism. For example, to run a Java application, one could either use a Java application, the ‘java -jar’ command, Tomcat or OS systemctl. In the case of node, it could be either npm or PM2. Each team or service may have had its own way of starting the process.

Infrastructure Host: This entails monitoring the machine itself for metrics such as CPU usage, memory, disk IO and Network IO or checking if the machine is offline.

Kubernetes Monitoring: This involves monitoring the Kubernetes cluster, such as instances where Kubernetes is unable to schedule a pod due to insufficient resources, etc.

Application Monitoring: This monitoring focuses on overseeing the services the team has created using various programming languages, designs and architectures. While each service may differ in its development approach and choice of tech stack, from an observability perspective, they need to be treated the same, except for some custom metrics.

Brower and Mobile Monitoring: This involves tracking and analyzing various metrics to ensure optimal performance and user experience across different platforms. Browser monitoring includes tracking page load times, rendering performance, JavaScript errors and resource usage. For mobile, it includes monitoring app crashes, latency, battery usage, network requests and device-specific metrics to ensure smooth operation and user satisfaction on mobile devices.

Before the advent of OpenTelemetry, there was no standardized method for monitoring applications. OpenTelemetry emerged from the merger of two earlier projects: OpenTracing and OpenCensus. It stands as a vendor- and tool-agnostic framework designed to instrument applications or systems irrespective of their language, infrastructure or runtime environment. OpenTelemetry represents a significant community effort, with its popularity and stability steadily growing.

3. Validate Observability Stack on Application Architecture

Applications vary in architecture and development stages. To address this diversity, you need multiple proofs-of-concept (POCs) across different systems, such as infrastructure, cloud service and backend and frontend applications.

Migrating the observability platform from one tech stack to another is simple when the application architecture is a monolith. However, when the underlying architecture is microservices, it becomes difficult.

While the microservices architecture offers flexibility in choosing different tech stacks for building services, devising a standard solution for capturing telemetry data becomes challenging due to the diverse range of tech stacks involved.

Conducting POCs for various tech combinations is necessary to build a self-service tool that individual teams can follow to migrate their services easily. For instance, if your service uses Java 11, Spring Boot 3.x.x and PostgreSQL, these POCs can provide you with standardized steps to enable application monitoring.

4. Migrate Core Component

Setting up the observability platform is necessary once the stack selection and POCs are completed. This process involves configuring the stack for metrics, logs and traces, as well as installing the necessary agents for various systems. One advantage of migrating to the observability platform is that there is no need to transfer telemetry data from the old platform to the new one. During the testing phase, you will have an overlapping period to accumulate enough useful data to create specific dashboards and alerts.

Migrating alerts and dashboards from the old system to the new system is essential, even when data migration is not necessary. This is because changes in the underlying techniques for capturing metrics may modify the names of the metrics and subsequently change all query expressions used in alerts and dashboards.

Some metrics may not be available in certain scenarios, especially derived metrics that can be recreated from underlying metrics.

While migrating query expressions manually is an option, this manual process can be error-prone and time-consuming for a large number of alerts. I had experienced a similar situation with alert numbers ranging from 100 to 400. To streamline the migration, we developed a Node.js script that programmatically converted New Relic’s query expressions to Prometheus expressions. The script performed the following high-level steps:

• Connected to the New Relic API server and fetched all the alerts configured for cloud integration
• Converted New Relic alert expressions into PromQL alert expressions
• Wrote all the Prometheus alert expressions into the YAML file.

Implementing this technique allowed us to successfully migrate all alerts and dashboards in just four to five days, a process that would normally demand a month’s manual labor. Additionally, scripting significantly minimized the risk of human error.

5. Test the Migration

Verifying each alert and dashboard is essential to ensure successful migration. During testing, both observability platforms — old and new — are run in parallel. This parallel operation allows for thorough testing of the metrics. However, running both systems concurrently may lead to a performance hit due to services sending telemetry data to two locations. Additionally, there may be a slight increase in costs until the complete transition.

Running both systems in parallel provides an ideal setup to ensure the migrated dashboards and alerts are functioning correctly.

Testing alerts that trigger during this process is straightforward, but evaluating those that have not yet triggered poses a challenge. In one of my projects, I reduced the threshold value of each alert and tested its functionalities. Approximately 90% of the alerts generated through custom scripts worked seamlessly, with only 10% requiring some manual tweaking.

6. Migrate Other Related Components

There are other systems that depend on the data generated by the observability platform. For example:
• Notification systems such as email and Slack channels, that alert in case of an incident
• Incident management tools including PagerDuty provide a streamlined way to handle incidents.

These systems rely on the alert’s payload to work properly. Their assets must be updated, as a change in the observability platform leads to a change in the alert payload. It is crucial to update the templates of notification systems to ensure they seamlessly integrate with the new alert payload.

Similarly, incident management tools such as PagerDuty require modifications to routing rules, escalation policies and scheduling. Fortunately, readymade open-source migration tools are available to facilitate easy migration from PagerDuty to Grafana’s OnCall. However, in other solutions, you may not have access to out-of-the-box migration tools. In such cases, manual migration or script writing may be required.

Timeframe for Migration

Planning a migration, which involves finalizing key telemetry data and systems, selecting relevant stacks and validating the observability stack on the application architecture, typically takes around three weeks.
The duration of completing the migration depends on factors such as the number of teams involved and the quantity of active services. For instance, a migration to open-source with 100 microservices and involvement from 10 different teams may take approximately four months to complete.

Conclusion

Transitioning your observability platform to an open-source stack offers a promising avenue for cost reduction and increased control over telemetry data. However, accomplishing this migration demands meticulous planning and execution, encompassing essential steps including feature prioritization, stack selection, POCs, core component migration, testing and migration of related systems. Despite the challenges posed by diverse architectures and technologies, a systematic approach, comprehensive assessment and collaboration among teams can facilitate a smooth migration process and guarantee its success.