Apache Druid's Role in Modern Data Analytics

From Twitter and Netflix to Salesforce and Confluent, modern analytics applications are being built by developers across the world’s leading digital businesses to create digital tools that serve real-time analytics to hundreds or even thousands of concurrent users.

Looking more closely at what these apps do reveals some critical, strategic technology decision-making. In particular, why do developers increasingly turn to Apache Druid instead of the myriad of other databases out there?

In general, they look to Druid to build apps that deliver on a range of objectives, including creating operational visibility at scale, customer-facing analytics, rapid-drill down exploration and real-time ML pipelines. Let’s take a look at each of these in turn:

Operational Visibility at Scale

Building apps for operational visibility can include a wide range of use cases, such as product analytics, digital operations, observability, fraud detection and IoT. In each situation, developers need to minimize the gap between events created and time-to-insight.

From a technology standpoint, there are several databases that can analyze events in real-time, including stream processors like Apache Flink or ksqlDB, time-series databases like InfluxDB or TimeScaleDB and also key-value stores such as Redis. The problem is that all of these technologies share the same inherent limitation for analytics: They aren’t able to process OLAP-style queries effectively at any meaningful scale.

For instance, if the app in question calls for analyzing time and non-time-based dimensions over the past month, assuming ingestion was one million events per second, that would equate to more than 2.6 trillion events. Any database that supports stream ingestion can query the current status (setting aside the challenge to handle ingestion at this scale) and, to an extent, basic aggregation of numbers and counters. Difficulties can arise, however, if the app requires aggregation across all of these events, “group by” on non-time-based attributes or high concurrency. Using the wrong database leaves users looking at the ‘spinning wheel of death’ as they wait for results.

In these circumstances, devs turn to Apache Druid. A real-world example explains how: Salesforce engineers built an analytics app using Druid to monitor their product experience. The app enables engineers, product owners and customer service reps to query any combination of dimensions, filters and aggregations on real-time logs for performance analysis, trend analysis and troubleshooting. As a result, their query results are returned in seconds with billions or even trillions of streaming events ingested every day.

Customer-Facing Analytics

On the face of things, the technologies offered by Twitter, Forescout and Atlassian don’t have a huge amount in common. Look beneath the surface, however, and they all apply analytics to more than just internal decision-making. They have also created analytics apps and processes to provide their customers with powerful insights as part of a value-added service or a core product offering

What does that teach other organizations? When building analytics for external users, a whole different set of tech challenges must be addressed. While concurrency is clearly a key consideration, the database should also support current and future user growth with ease. The thing is, concurrency also has to be delivered economically, or it can have a huge impact on infrastructure costs.

Similarly, delivering a good interactive experience is dependent on sub-second query response performance—it has become a minimum user requirement almost everywhere. Instead of fighting PostgreSQL or Hive or any other analytics database to meet the performance requirements at scale, Druid offers the performance devs and users expect.

Take Confluent, for example, which is a popular event streaming platform. They built their cloud-based monitoring service, Confluent Health+, with Druid. Having initially used a different database, they discovered that, as their data volume grew, their legacy pipeline struggled to keep up with their data ingestion and query loads. By switching to Druid, they have been able to create the impressive user experience developers everywhere are focused on delivering.

Rapid Drill-Down Exploration

Generally speaking, the primary role of reports is to fulfill a simple requirement: Deliver data in a consumable form. What organizations are increasingly looking for, however, is a deeper level of insight and understanding that shows why something happened. They can take this information and apply it to a wide range of possibilities, such as solving, anticipating or preventing future problems or opportunities.

Addressing these ‘why?’ questions means data has to be analyzed at tremendous speed to reveal root causes. While this is a simple proposition when the data set in question is small, when dealing with data drawn from today’s high-volume sources, such as cloud services, clicks and IoT/telemetry, the underlying database has to correlate a few to hundreds of dimensions across highly cardinal data with billions to trillions of rows. That presents a much more difficult challenge.

Indeed, when data is analyzed at a huge scale, full table scans are simply too slow. Exacerbating the problem is the typical query-shaping techniques used to speed up performance have expensive tradeoffs:

Looking at real-world use cases once again helps demonstrate the role of Apache Druid in situations such as these. For example, developers at cybersecurity company HUMAN built an analytics app for internet bot detection using Apache Druid. Previously, their cloud data warehouse was too slow to keep up with their need to drill into massive web traffic quickly. Today, their Apache-powered app gives their data scientists the ability to aggregate and filter across trillions of events instantly, meaning they can classify anomalies and train their machine learning models to detect real-time malicious activity.

Real-Time ML Pipelines

For many people, analytics is synonymous with viewing a synthesized data set presented via a UI like a chart, graph, report or dashboard. Increasingly, however, new operational workflows are completely automated without any human intervention or data visualization.

Google Maps is an example of this use case, given it optimizes routing based on real-time and historical traffic flow and patterns. Another example is Reddit’s use of Apache Druid to deliver real-time ad placement within milliseconds of a new visitor arriving on the app. In this context, there is no data visualization to interpret; it has been replaced by automated decisions derived by inference. These performance requirements play particularly well to the capabilities offered by Druid.

Across a growing variety of demanding use cases, devs are adopting Druid to build inference-based capabilities into their apps. From diagnostics, recommendations and automated decisions, today’s organizations need instant query responses of high cardinality, alongside highly dimensional event data at scale and in real-time. Delivering performance levels like this is becoming increasingly important for businesses focused on innovation, quality of service and competitive advantage.