Achieving full observability across enterprise applications is challenging. While third-party services like New Relic or Datadog make end-to-end observability easier, as your application grows in complexity, so does the telemetry data, leading to increased costs. Migrating to an open source stack gives you control over telemetry data and reduces observability costs, despite challenges with existing service provider commitments.
Migrating to open source solutions is straightforward with simple architecture and tech stacks. However, transitioning the entire observability stack to open source demands meticulous planning, tools/frameworks assessments, testing, risk analysis, and cross-team communication, especially given the number of microservices, heavy cloud reliance, and diverse languages and frameworks to take into account.
A cloud observability platform usually absorbs 20%-30% of overall infrastructure spending, but in some cases, it can reach 50%-60%. If your observability expenses exceed 50%, then transitioning to open source solutions would be a sensible option to mitigate tech infrastructure costs.
For the past two years, I have been working on migrating the observability platform from a proprietary solution provider to an open source stack. Let’s dive into the key steps necessary for such a migration.
Finalize the key telemetry data and systems
Managed observability platforms offer comprehensive insights into system health, necessitating a significant volume of high-quality telemetry data. By default, the agents tasked with capturing and transmitting telemetry data are configured to gather as much information as possible, facilitating the creation of extensive dashboards and reports. However, this extensive data collection contributes to the overall cost.
Select which systems you actually need to monitor. A typical enterprise application includes databases, caching databases, container orchestrators, and numerous cloud services. Engineers often monitor more services than necessary, which leads to unnecessary complexities and costs. It’s often sufficient to streamline the monitoring scope to essential systems.
In my experience, the quantity of telemetry data required is significantly lower than what third-party services typically retrieve. You can easily divide those data collections into two buckets: must-have and nice-to-have. The “must-have” data aligns with the well-established four golden signals:
- Latency: the amount of time it takes to serve a request
- Traffic: the volume of requests that a system is currently handling
- The error rate: the number of requests that fail or return unexpected responses
- Resource saturation: the percentage of available resources being consumed
These fundamental signals serve as the foundation for key dashboards and alerts. Observability platforms built on these signals cover 80% of typical use cases.
Understanding each signal’s metadata requirements is crucial. The extent of metadata captured directly impacts the observability platform’s complexity and cost. For example, when monitoring service-to-service latency, consider the necessity for service IP addresses, instance IDs, or header information.
Select the relevant stack
Managed observability platforms such as New Relic and Datadog provide comprehensive monitoring for various enterprise components. However, when you migrate to an open source stack, you must assess and integrate the tool stacks to meet diverse monitoring requirements.
One important aspect to consider is scaling: how to manage the huge amount of data generated every minute across all those systems. Focus on two fronts: selecting the stack for storing and processing telemetry data (logs, metrics and traces) and devising methods for capturing and pushing telemetry data from diverse systems.
Stack for logs
You require a stack that can efficiently and economically process and store the substantial volume of logs generated by the system. In the past, I used the ELK stack for storing and searching logs, but it is a generic solution and not tailored specifically for logs.
I recommend Grafana’s Loki for its effective management of large log volumes and LogQL, a language akin to PromQL. If you are familiar with PromQL, navigating through extensive log data with LogQL is easy and intuitive.
Stack for Metrics
Prometheus is a popular database for storing time-series metrics data. However, it has scaling-related limitations; it cannot scale horizontally. Alternatives like Thanos, Grafana’s Mimir, and VictoriaMetrics offer better out-of-the-box solutions for horizontal scaling.
I conducted some research and ultimately chose Grafana’s Mimir to store metrics in a similar migration project. This decision was based on Mimir’s remote storage capability, as well as its scalable and highly available architecture.
Stack of Tracing
Distributed tracing is a must in a microservices architecture, for easily identifying bottlenecks in latency and for resolving performance-related bugs. Grafana’s Tempo could be a good option for several reasons:
- It can seamlessly integrate as a backend for Grafana’s dashboard. Consolidating dashboards in Grafana helps avoid navigating between multiple applications. Switching platforms just for tracing while using Grafana for other aspects of observability would not be practical.
- Tempo provides a similar experience to TraceQL when interacting with the database, streamlining the learning process and significantly reducing the learning curve.
- Tempo is compatible with popular open source tracing protocols such as Zipkin and Jaeger. If any team already uses these protocols, transitioning to Grafana’s Tempo would be smoother.
You have to install an agent on the system to capture and push the telemetry data. Those steps vary based on the system being monitored. Typical enterprise application setups involve monitoring cloud integrations, process monitoring, infrastructure hosts, Kubernetes clusters and application monitoring. Among the possibilities:
- Cloud integrations: This includes monitoring for cloud services such as SQS, SNS, EMR and EC2.
- Process monitoring: This involves monitoring processes running on bare-metal machines. With the advent of dockerization and Kubernetes, there is now a standardized way of starting the application. In the past, there was no fixed mechanism. For example, to run a Java application, you could use a Java application, the ‘java -jar’ command, Tomcat, or OS systemctl. In the case of node, it could be npm or PM2. Each team or service may have had its own way of starting the process.
- Infrastructure host: This entails monitoring the machine itself for metrics such as CPU usage, memory, disk I/O and Network IO or checking if the machine is offline.
- Kubernetes monitoring: You regularly need to monitor a Kubernetes cluster, such as instances where Kubernetes is unable to schedule a pod due to insufficient resources.
- Application monitoring: This monitoring focuses on overseeing the services the team created. Each service may differ in its development approach and choice of tech stack; but from an observability perspective, they usually are treated the same.
- Browser and mobile monitoring: These metrics ensure optimal performance and user experience across different platforms. Browser monitoring includes tracking page load times, rendering performance, JavaScript errors and resource usage. For mobile, it includes monitoring app crashes, latency, battery usage, network requests and device-specific metrics.
Before the advent of OpenTelemetry, there was no standardized method for monitoring applications. OpenTelemetry emerged from the merger of two earlier projects, OpenTracing and OpenCensus. It stands as a vendor- and tool-agnostic framework to instrument applications or systems irrespective of language, infrastructure or runtime environment. OpenTelemetry represents a significant community effort, and its popularity and stability are steadily growing.
Validate observability stack on application architecture
Applications vary in architecture and development stages. To address this diversity, you need multiple proofs-of-concept (POCs) across different systems, such as infrastructure, cloud service and backend and frontend applications.
Migrating an observability platform from one tech stack to another is simple when the application architecture is a monolith. However, when the underlying architecture is microservices, it becomes difficult.
While the microservices architecture offers flexibility in choosing different tech stacks for building services, devising a standard solution for capturing telemetry data becomes challenging due to the diverse range of tech stacks involved.
Conducting POCs for various tech combinations is necessary to build a self-service tool that individual teams can follow to migrate their services easily. For instance, if your service uses Java 11, Spring Boot 3.x.x and PostgreSQL, these POCs can provide standardized steps to enable application monitoring.
Migrate core components
Setting up the observability platform involves configuring the stack for metrics, logs and traces, as well as installing the necessary agents. One advantage of migrating the observability platform is that there’s no need to transfer old telemetry data. During the testing phase, you will have an overlapping period to accumulate enough useful data to create specific dashboards and alerts.
Migrating alerts and dashboards to the new system is essential, even when data migration is not necessary. This is because changes in the underlying techniques for capturing metrics may modify the names of the metrics and subsequently change all query expressions used in alerts and dashboards. Some metrics may not be available in certain scenarios, especially derived metrics that can be recreated from underlying metrics.
While migrating query expressions manually is an option, a manual process is error-prone and time-consuming for a large number of alerts. I had experienced a similar situation with alert numbers ranging from 100 to 400.
To streamline the migration, we developed a Node.js script that programmatically converted New Relic’s query expressions to Prometheus expressions. The script performed the following high-level steps:
- Connected to the New Relic API server and fetched all the alerts configured for cloud integration
- Converted New Relic alert expressions into PromQL alert expressions
- Wrote all the Prometheus alert expressions into the YAML file
Implementing this technique allowed us to successfully migrate all alerts and dashboards in four or five days, a process that would normally demand a month’s manual labor. Additionally, scripting significantly minimized the risk of human error.
Test the migration
Verify each alert and dashboard. Run both old and new platforms during testing. Parallel operation permits thorough testing of the metrics.
However, running both systems concurrently may lead to a performance hit due to services sending telemetry data to two locations. Additionally, there may be a slight increase in costs until the complete transition.
Running both systems in parallel provides an ideal setup to ensure the migrated dashboards and alerts are functioning correctly.
Testing alerts that trigger during this process is straightforward, but evaluating those that have not yet triggered poses a challenge. In one project, I reduced the threshold value of each alert and tested its functionalities. Approximately 90% of the alerts generated through custom scripts worked seamlessly, with only 10% requiring some manual tweaking.
Migrate other related components
Other systems depend on the data generated by the observability platform. For example:
- Notification systems like email, Slack channel, etc., that alert in case of an incident
- Incident management tools like PagerDuty provide a streamlined way to handle incidents.
These systems rely on the alert’s payload to work properly. Their assets should be updated, as a change in the observability platform leads to a change in the alert payload. Update the templates of notification systems to ensure they seamlessly integrate with the new alert payload.
Similarly, incident management tools like PagerDuty require modifications to routing rules, escalation policies, and scheduling. Fortunately, readymade open-source migration tools are available to facilitate easy migration from PagerDuty to Grafana’s OnCall. However, in other solutions, you may not have access to out-of-the-box migration tools. In such cases, manual migration or script writing may be required.
Timeframe for the Migration
Planning a migration involves finalizing key telemetry data and systems, selecting relevant stacks. and validating the observability stack on the application architecture. It typically takes around three weeks.
The project duration depends on factors such as the number of teams involved and the quantity of active services. For instance, a migration to open source with 100 microservices and involvement from 10 different teams may take four months to complete.
Migration Complete!
Transitioning your observability platform to an open-source stack offers a promising avenue for cost reduction and increased control over telemetry data. However, accomplishing this migration demands meticulous planning and execution, encompassing essential steps like feature prioritization, stack selection, POCs, core component migration, testing, and the migration of related systems. Despite the challenges posed by diverse architectures and technologies, a systematic approach, comprehensive assessment, and collaboration among teams can facilitate a smooth migration process and guarantee its success.