Blogs

Improving Observability With ML-Enabled Anomaly Detection

Nowadays, DevOps and SRE teams have many tools to access and analyze logging data. However, there are two challenges that prevent these teams from resolving issues in a timely manner: 

  • They aren’t equipped with all the data they need
  • Detecting and resolving issues is reactive and manual

In this article, I’m going to break down why these challenges persist and propose a new approach to observability that will help you overcome them.

Observability Obstacles

Challenge: You’re not analyzing all your log data

DevOps and SRE teams need to understand the behavior of their applications and services to spot anomalies as they occur and keep them up and running. However, the cost of traditional logging platforms prevents these teams from analyzing 100% of their observability data. 

Instead, they have adopted mechanisms to reduce ingestion volumes. In some cases, this means truncating or filtering out datasets. In others, it means moving data to a less active storage tier/target where analytics are not applied. Furthermore, engineers have to predict upfront which datasets are important and meet the criteria to index. It’s these datasets that teams have the ability to analyze in real-time. The rest are ultimately neglected.

The net impact is that teams are left with an incomplete picture of their environment, making it difficult to understand how applications and services are behaving. As a result, they cannot pinpoint every issue as it occurs.

Challenge: Traditional platforms create reactive and manual processes

When working with traditional observability platforms, teams must constantly define logic to detect issues in their environment. And often, they are only able to detect the issues that they are aware of and have had time to build logic for. To keep alerts firing correctly, you have to stay on top of different data structures, changing data shapes, new schemas and libraries, and fluctuating baselines. 

However, not every team is equipped to detect issues they haven’t seen before. When teams are faced with unknown behavior, they are left conducting a post-mortem debug which is inherently a reactive process. Furthermore, it is increasingly hard to resolve the issue quickly because logging data is often unstructured and difficult to search. Teams have to manually hunt through loglines to pinpoint the activity and affected systems or components. This approach can add hours or even days to an investigation.

Improving Incident Response

The solution: Machine learning-enabled anomaly detection and resolution

To catch every issue and speed up incident investigations, teams can apply machine learning at the data source to detect potential threats and equip teams with the information they need to keep their applications and services running. This kind of approach allows engineers to analyze 100% of their log data and compile it in a consumable and actionable view so that they can spot changes in behavior and resolve anomalies quicker than the traditional approach.

As loglines are created, they can be automatically analyzed and converted into insightful metrics that inform application and service behavior. From there, federated machine learning can be used to automatically detect and alert on an issue as it occurs. It does so by baselining key metrics, understanding when they’re outside of typical ranges, and determining the likelihood of anomalous behavior—all through machine learning.

There is no need to continually build and refine logic, effectively eliminating overly manual engineering practices.

When an issue occurs, developers and DevOps engineers will receive a detailed report of the specific changes in activity within their systems. In this report, they’ll see the exact window of time of the incident and the affected systems or metadata. This report also includes a pinpointed capture of the relevant raw data that contributed to the anomaly, as well as systems, components, or services that were involved.

Finally, the relevant full-fidelity logs can be dynamically streamed to a central observability platform (Splunk, Datadog, etc.) before, during and after the anomaly. This gives teams the exact data they need to resolve the issue in a timely manner.

In all, this kind of edge observability allows system administrators to analyze all of their logs all of the time—not just ones that match the pre-defined query in a centralized setup, saving tons of time when it comes to incident response.

Ozan Unlu

Ozan Unlu is currently co-founder and chief executive officer at Edge Delta.

Recent Posts

Valkey is Rapidly Overtaking Redis

Redis is taking it in the chops, as both maintainers and customers move to the Valkey Redis fork.

5 hours ago

GitLab Adds AI Chat Interface to Increase DevOps Productivity

GitLab Duo Chat is a natural language interface which helps generate code, create tests and access code summarizations.

10 hours ago

The Role of AI in Securing Software and Data Supply Chains

Expect attacks on the open source software supply chain to accelerate, with attackers automating attacks in common open source software…

15 hours ago

Exploring Low/No-Code Platforms, GenAI, Copilots and Code Generators

The emergence of low/no-code platforms is challenging traditional notions of coding expertise. Gone are the days when coding was an…

1 day ago

Datadog DevSecOps Report Shines Spotlight on Java Security Issues

Datadog today published a State of DevSecOps report that finds 90% of Java services running in a production environment are…

2 days ago

OpenSSF warns of Open Source Social Engineering Threats

Linux dodged a bullet. If the XZ exploit had gone undiscovered for only a few more weeks, millions of Linux…

2 days ago