Observability is becoming a keystone of contemporary DevOps practices. Even departments that weren’t traditionally a part of DevOps are seeing the benefits of being brought under the auspices of observability teams. In 2023, however, organizations are finding that the road to adoption is bumpier than expected. Here are seven of the biggest challenges DevOps teams face with observability and some suggestions for mitigating them.
Increasing MTTR
MTTR, or mean time to recovery, is the time it takes to get a system back up and running after an outage or bug. A longer MTTR means more downtime and poorer service for clients. Worryingly, the DevOps Pulse Report indicates that the average MTTR is increasing. This year, 73% of its respondents reported an MTTR of multiple hours; last year’s figure was only 64%.
MTTR often results from an inability to diagnose incidents due to data siloes impairing observability. This can be improved by an observability platform that allows engineers to see the big picture.
Costs of Telemetry Data
Along with increasing MTTR, many organizations have to deal with costs incurred by high telemetry data volumes. It’s a big problem. An IDC survey of 200 companies found that 53% of respondents highlighted the costs of storing log data.
Much of the problem is due to an outmoded tiered pricing model. Many vendors charge per GB of data, so if your data volumes fluctuate, so will your data costs. They also have obscure pricing models, meaning many organizations find it difficult to know what they’re paying for. At Coralogix, we’ve created a new business model that is one-third of the cost of standard log storage solutions.
Tool Sprawl
To understand why this is a challenge, we need to answer the question: What is observability? Effective observability requires integrating data from every aspect of your application. Because many organizations implement monitoring with multiple tools, they suffer from tool sprawl. This has the effect of siloing telemetry data, making it harder to correlate data and extract insights into system performance.
There are a range of options for mitigating tool sprawl, such as thoroughly evaluating a tool’s costs and benefits before including it in your DevOps strategy. The most effective solution is a “single pane of glass” tool that combines insights synoptically on a single dashboard.
Kubernetes Complexity
Elastic reports that organizations are increasingly turning to cloud-based solutions such as Kubernetes for their DevOps. Kubernetes can supercharge organizations with its ability to dynamically scale infrastructure as needed, eliminating the cost overheads of dedicated servers.
However, Kubernetes is complex and comes with its own set of challenges. Kubernetes’ scalable architecture comes from containerization, a paradigm in which applications are hosted in objects called containers. This means that developing in Kubernetes requires the ability to work with a lot of spinning plates.
A good way to combat this is better training in organizations. Additionally, breaking down silos allows different teams to transfer knowledge.
Security Challenges
Kubernetes’ popularity brings security challenges. These can include privilege escalation, where a user manages to gain privileges such as write access, and security misconfigurations, where developers forget to change the non-secure default configurations.
There are several strategies for mitigating Kubernetes security risks. These include scoping roles to particular namespaces, using service meshes, and enhancing security with Coralogix’s Kubernetes Operator.
Beyond Kubernetes, there’s the larger issue of integrating security into an observability strategy, which is becoming a challenge for an increasing number of IT businesses. To counter this, more and more businesses are starting to incorporate observability and security monitoring under a single umbrella. Solutions such as infrastructure and application metrics can enhance security and monitoring.
Scaling Platforms
To meet the challenges posed by rising data costs and increasing cloud complexity, businesses are turning to open-source solutions. These come with their own challenges, however. According to the DevOps Pulse Report, around 30% of enterprises surveyed had problems with infrastructure management, scaling and upgrading relevant components. Because many open-source platforms require specialized knowledge to maintain, businesses have trouble sourcing the skills and expertise for them.
Tools such as OpenTelemetry can make scaling easier by integrating with platforms such as Coralogix.
Troubleshooting Data Pipeline Performance
Implementing observability requires having a reliable and high-performance pipeline for telemetry data. However, organizations using open-source platforms often have trouble monitoring and troubleshooting the performance of their data pipeline. This can impair observability as telemetry data is of lower quality.
Data engineer Abraham Alcantara suggests ten key steps to troubleshoot data pipelines successfully. These include identifying data pipeline software and infrastructure, reproducing and isolating issues and automating issue scenarios. Another strategy is to apply machine learning, such as Coralogix uses.