Improving Observability With ML-Enabled Anomaly Detection

Nowadays, DevOps and SRE teams have many tools to access and analyze logging data. However, there are two challenges that prevent these teams from resolving issues in a timely manner:

They aren’t equipped with all the data they need
Detecting and resolving issues is reactive and manual

In this article, I’m going to break down why these challenges persist and propose a new approach to observability that will help you overcome them.

Observability Obstacles

Challenge: You’re not analyzing all your log data

DevOps and SRE teams need to understand the behavior of their applications and services to spot anomalies as they occur and keep them up and running. However, the cost of traditional logging platforms prevents these teams from analyzing 100% of their observability data.

Instead, they have adopted mechanisms to reduce ingestion volumes. In some cases, this means truncating or filtering out datasets. In others, it means moving data to a less active storage tier/target where analytics are not applied. Furthermore, engineers have to predict upfront which datasets are important and meet the criteria to index. It’s these datasets that teams have the ability to analyze in real-time. The rest are ultimately neglected.

The net impact is that teams are left with an incomplete picture of their environment, making it difficult to understand how applications and services are behaving. As a result, they cannot pinpoint every issue as it occurs.

Challenge: Traditional platforms create reactive and manual processes

When working with traditional observability platforms, teams must constantly define logic to detect issues in their environment. And often, they are only able to detect the issues that they are aware of and have had time to build logic for. To keep alerts firing correctly, you have to stay on top of different data structures, changing data shapes, new schemas and libraries, and fluctuating baselines.

However, not every team is equipped to detect issues they haven’t seen before. When teams are faced with unknown behavior, they are left conducting a post-mortem debug which is inherently a reactive process. Furthermore, it is increasingly hard to resolve the issue quickly because logging data is often unstructured and difficult to search. Teams have to manually hunt through loglines to pinpoint the activity and affected systems or components. This approach can add hours or even days to an investigation.

Improving Incident Response

The solution: Machine learning-enabled anomaly detection and resolution

To catch every issue and speed up incident investigations, teams can apply machine learning at the data source to detect potential threats and equip teams with the information they need to keep their applications and services running. This kind of approach allows engineers to analyze 100% of their log data and compile it in a consumable and actionable view so that they can spot changes in behavior and resolve anomalies quicker than the traditional approach.

As loglines are created, they can be automatically analyzed and converted into insightful metrics that inform application and service behavior. From there, federated machine learning can be used to automatically detect and alert on an issue as it occurs. It does so by baselining key metrics, understanding when they’re outside of typical ranges, and determining the likelihood of anomalous behavior—all through machine learning.

There is no need to continually build and refine logic, effectively eliminating overly manual engineering practices.

When an issue occurs, developers and DevOps engineers will receive a detailed report of the specific changes in activity within their systems. In this report, they’ll see the exact window of time of the incident and the affected systems or metadata. This report also includes a pinpointed capture of the relevant raw data that contributed to the anomaly, as well as systems, components, or services that were involved.

Finally, the relevant full-fidelity logs can be dynamically streamed to a central observability platform (Splunk, Datadog, etc.) before, during and after the anomaly. This gives teams the exact data they need to resolve the issue in a timely manner.

In all, this kind of edge observability allows system administrators to analyze all of their logs all of the time—not just ones that match the pre-defined query in a centralized setup, saving tons of time when it comes to incident response.