A global survey of 500 IT professionals suggests organizations are not making a lot of progress in their ability to truly observe application environments, especially as they become more complex with each passing day.
The survey, conducted by Logz.io, a provider of an observability platform, found only one in 10 respondents reported they have full observability into their application environments.
Logz.io CTO Asaf Yigal said while more DevOps teams than ever are collecting logs, metrics and traces, most of them have yet to determine how to operationalize all the data being collected. As data is ingested, it doesn’t just need to be stored; it also needs to be correlated to the various services that make up an application, he noted.
The more organizations deploy cloud-native applications in production environments the more pressing this issue becomes. All the microservices that make up those applications are now generating a massive amount of telemetry data that results in more alerts than ever being generated.
Not surprisingly, the biggest challenges organizations encounter when managing Kubernetes clusters in production environments are monitoring/troubleshooting (40%), followed closely by security (37%) and networking (33%).
Many organizations simply don’t have the skills required to manage cloud-native applications. Nearly half of respondents (48%) specifically cited a lack of knowledge as the biggest challenge they encountered when trying to observe these types of applications. From an IT management perspective, most microservices all use the same basic template, so it’s difficult for DevOps teams to identify which microservices are likely to have the biggest impact on service level objectives (SLOs) and service level agreements (SLAs) in the event of a disruption, noted Yigal.
In the absence of the ability to determine the actual root cause of an issue, alert storms conspire to increase fatigue—and that eventually results in higher levels of burnout across the DevOps team, noted Yigal. In fact, 82% of respondents said their mean-time-to-resolution (MTTR) during production incidents was more than an hour.
The survey also noted that more than half of respondents (52%) worked for organizations that are simultaneously trying to rein in monitoring costs. More than three-quarters (76%) of respondents also noted that open source OpenTelemetry (OTEL) or OTEL-centric tooling was at least somewhat important to their overall observability strategy.
In addition, 87% of respondents said their organization is already using some form of platform engineering to manage DevOps workflows at scale.
It’s not clear how quickly organizations are adopting observability tools and platforms, but too many DevOps teams lack the visibility needed to identify the root cause of an issue. As a result, variations of the same problems often keep manifesting themselves because previous remediation efforts simply didn’t go deep enough to resolve the core issue.
Of course, there may come a day when machine learning algorithms, along with other forms of artificial intelligence (AI), will make it simpler to surface these issues. The challenge, in the meantime, is laying down the observability foundation today to provide access to the data that will be required to train those AI models.