Stream Big, Think Bigger: Analyze Streaming Data at Scale

Data streams are all the rage. Once a niche element of data engineering, streaming data is the new normal—more than 80% of Fortune 100 companies have adopted Apache Kafka, the most common streaming platform, and every major cloud provider (AWS, Google Cloud Platform and Microsoft Azure) has launched its own streaming service.

This shift is fueled by the growing demand to deliver events quickly, reliably and at scale to support a wide range of internal and external applications. From always-on, customer-facing applications to mission-critical operations, a growing number of use cases depend on subsecond updates—further upping the ante for real-time data processing and dissemination.

There’s no doubt 2023 will be a big year for streaming technology.

Meanwhile, as streaming moves toward ubiquity, we’re seeing another shift in the way businesses use the data: Analyzing events as they’re created for real-time insights. With the right tools in place, people can instantly compare what’s happening now to what happened in the past, triage problems as they occur and make time-sensitive decisions in the moment.

With the proliferation of data streams, a new set of use cases and requirements for real-time analytics has emerged. To unlock the power of streaming in 2023, data teams will need to look beyond the traditional batch-oriented stack and adopt a streaming-native approach to the analytics architecture.

The Streaming Paradigm Shift

Data has always been at the center of how businesses operate. Until recently, we lived in a very batch-dominant world of business and data. The goal of data infrastructure was to identify data at a fixed point in time and store it for eventual use. The evolution from daily batch operations on mainframes to the internet-driven always-on world replaces fixed, static “data at rest” with fast-moving data in motion, as information flows freely between applications and data systems within and between organizations.

Data at rest still exists, and batch systems continue to serve a variety of reporting purposes. But reality isn’t fixed or static, and to meet the demands for seamless, authentic data experiences, the systems we build must be designed for data in motion.

Hence, the rise of streaming technology—and with it, a new mindset for how we think about data. As streaming increases, streaming data platforms increasingly act as an organization’s central nervous system, connecting the different functions and driving mission-critical operations. Evolving technologies are purpose-built to support data in motion systems: Stream processors and event databases.

Apache Druid fits into the purpose-built category; as a real-time analytics database, it is designed to enable users to query every event as it joins the data stream and do it at a tremendous scale while enabling subsecond queries on a mix of stream and batch data.

Thousands of businesses are already using streaming processors like Kafka and Amazon Kinesis with Druid to build cutting-edge systems that make terabytes to petabytes of streaming data accessible to applications and humans in milliseconds. Some of those businesses—including Reddit, Citrix and Expedia—were highlighted at Current 2022, the annual streaming event organized by Confluent.

The next evolution of data intelligence is here, and it’s based on the ability to react to events as they are happening. But on the growth chart of streaming adoption, we are only at the beginning of an upward curve—a critical inflection point where streaming (and the technology that is built for it) starts to become the bedrock of everyone’s data architecture.

So, when it comes to enabling scalable, subsecond analytics on streaming data, many developers and data innovators are wondering, ‘How?’

We’re Streaming! Now What?

At Current, we spoke with hundreds of Kafka users who were turning their attention to building out the next phase of their data streaming pipelines and embedding them more deeply into their organizations.

While streaming adoption is becoming more widespread, most companies still have just one or two use cases they’re solving with a streaming platform. At Current, many people spoke about how Kafka effectively set their data in motion. But when it came time to analyze or put those streams to work in a user-facing application, their “data in motion” turned into “data in waiting” because their analytics system was designed for batch data, not streaming.

To solve this problem, a new type of database is needed—and that’s why Druid was created. Turning billions or trillions of events into streams that can then be immediately queried by thousands of users simultaneously is where stream processors like Kafka, in combination with Druid, have unlocked a new set of capabilities that are enabling new developer-built analytics applications.

For example, Reddit generates tens of gigabytes of event data per hour from advertisements on its platform. To let advertisers understand their impact and decide how to target their spending, Reddit needs to enable interactive queries across the last six months of data—hundreds of billions of raw events to slice and dice. But they also needed to empower advertisers to see user groups and sizes in real-time, adjusting based on interests (sports, art, movies, technology…) and locations to find how many Reddit users fit their target. To achieve this, they built a Druid-powered application that ingests data from Kafka and enables Reddit’s advertising partners to make real-time decisions and get the best possible ROI on their campaigns.

Stream-to-Batch Vs. Stream-to-Stream

Reddit chose Druid as the database layer of their application in large part because of its close integration with Kafka, as Druid was designed to ingest and analyze streaming data. This sets Druid apart from virtually every other analytics database, the rest of which were built for batch ingestion.

Batch ingestion works exactly the way it sounds—it takes chunks of stream data, dumps them into a file, processes the file, then loads it into the database. Using a batch-based system to analyze streams introduces a bottleneck to your streaming pipeline. By contrast, Druid provides a connector-free integration with the top streaming platforms and handles the latency, consistency and scale requirements for high-performance stream analytics cost-effectively.

In addition to out-of-the-box integrations, Druid has built-in indexing services that provide event-by-event ingestion. This means streaming data is ingested into memory and made immediately available for use. No waiting for events to be processed through batch processes before they can be queried. This capability, coupled with exactly-once semantics, guarantees data is always fresh and consistent. If there’s a failure during stream ingestion, Druid will automatically continue to ingest every event exactly once (including events that enter the stream during the outage) to prevent any duplicates or data loss.

However, perhaps the most powerful reason to use Druid for streaming data analytics is its near-infinite scalability. For even the largest, most complex ingestion jobs, Druid can easily scale into the tens of millions of events per second. With variable ingestion patterns, Druid avoids resource lag (as well as overprovisioning) by enabling dynamic scaling. It combines the query and ingestion performance of shared-nothing architecture with a cloud data warehouse’s flexibility and non-stop reliability. This means you can add computing power and scale out without downtime or rebalancing, which is handled automatically.

In essence, while using a cloud data warehouse to serve both batch-oriented and real-time use cases might sound efficient, doing so defeats the purpose of a streaming pipeline.

“The weakest link in the chain sort of dominates everything,” Databricks CEO Ali Ghodsi told the audience at Current. “If you have one step of your pipeline that is batch-processed and really slow, then it doesn’t matter how fast you are in the rest of the pipeline.”

That’s why when Confluent’s data engineering team needed to build a real-time observability platform, they chose to use Kafka and Druid together. Today, the platform ingests over five million events per second from Kafka and handles hundreds of concurrent queries from a large team of analysts.

“We do a lot of things today that we didn’t imagine doing three or four years ago,” Matt Armstrong, who leads the observability and data platform team at Confluent, said in a Druid Summit 2022 interview. “From a [product] perspective, you could think of Druid as being able to power a great product [Confluent Cloud] with a great support experience—but perhaps more importantly—an unconstrained future.”

It’s Time to Think Big

We’re already moving to this entirely new way of thinking about moving, analyzing, and sharing data. Streaming technology has opened the door for a plethora of new use cases and products to come to fruition. We believe organizations that view streaming data as a force multiplier and implement a streaming-native approach to analytics will gain a competitive advantage in 2023 and beyond.

Druid is purpose-built for these new streaming use cases, but it’s also built for a new type of application—one that draws characteristics from both the analytics and transactional database worlds. Industry-leading teams have built mission-critical applications that enable operational visibility, customer-facing analytics, drill-down exploration and real-time decision-making using streaming data and Druid. This trend is now starting to take hold and the next wave of analytics applications is already being built.