How to Build a Data Platform for Self-Service, Ad-Hoc Analytics

In a digital world, data is often the differentiator between success and failure. Whether it’s defending against cybersecurity threats, improving application performance, or resolving full-blown outages, data is critical to the modern DevOps team.

In fact, there are plenty of situations where DevOps engineers, SRE teams and other observability-focused departments need to explore reams of data rapidly and flexibly. An SRE trying to figure out the root cause of latency may need to dissect the movement of data across endpoints to find the malfunctioning one. Another engineer may need to compare historical and current performance to find anomalous behavior. Lastly, a DevOps engineer may have to dissect user performance metrics to understand the scope of a slowdown—whether it is global or regional.

Every use case requires dashboards that can provide detailed data through a wide variety of visualizations, including choropleth maps, stack areas, pie charts and bar graphs. Ideally, a dashboard will enable teams to isolate dimensions, apply filters, and dive into data for deeper insights.

Unfortunately, not all dashboards can support this type of fast, flexible analysis. Many still utilize outdated technologies, which lack the flexibility, agility and scalability required to seamlessly explore data. Many were also designed without the urgency that today’s data requires—after all, in previous years, most uses for dashboards (such as internal reporting) were not time-sensitive.

Other dashboards are constrained by templates, which offer a finite array of widgets, tools and drill-down capabilities. These dashboards may feel unwieldy and sluggish, unsuited for a rapidly evolving situation like an application outage.

Requirements

To truly fulfill the promise of self-service, ad-hoc analytics across huge datasets, teams need to work in real-time—and so they need a database capable of timely responses.

When an application goes down, an SRE may not know what to look for and, hence, needs to quickly comb through lots of data. While some dashboards rely on workarounds for faster queries, such as pre-aggregations, precomputing, or rollups, this is not possible in this instance because the SRE simply won’t know what to look for. After all, they can’t necessarily predict what will go wrong, and even if they work on assumptions from previous failures, this current issue could be much different.

Therefore, dashboards must provide plenty of functionalities and visualizations. Users should be able to filter data by time, isolate variables like location, and zoom into specific time intervals with a few clicks. Dashboards should also accommodate diverse data types, including intricate parent-child relationships and nested columns.

Further, dashboards must offer a depth of insight. An online streaming media platform may need to assess user metrics (such as latency or load times) across different devices, operating systems, and regions to fine-tune performance. A cloud provider has to monitor their physical hardware for high temperatures, slowed network switches or devices, and other anomalies.

Because data is now so important to success, many more people across a company, including data scientists, product managers and external users, themselves require data-driven insights. In these situations, dashboards must handle the increased user traffic and query activity, maintaining fast responses even under load. This is especially important considering that a single-user operation (such as a zoom) will require multiple queries on the backend to execute.

A database also must scale seamlessly. If an organization’s environment generates millions of events an hour, that equates to billions of events per day or week—challenging for any database to ingest, store, analyze and query. In fact, many databases cannot successfully provide fast response times while managing large datasets and high query volume. As an example, transactional databases (OLTP) can often query rapidly but cannot execute analytics at scale, while analytical databases (OLAP) can analyze massive volumes of data but not at speed.

Apache Druid for Independent, Ad-hoc Data Exploration

This is where open source Apache Druid comes in. Combining the scale and advanced analytics of an OLAP database with the speed of an OLTP database, Druid offers swift, real-time data exploration.

Upon ingesting data, Druid makes it immediately available for querying and analysis, removing the need to first batch or aggregate the data in some way. In addition, Druid natively integrates with streaming technologies like Apache Kafka and Amazon Kinesis, removing the need for workarounds or connectors.

Druid powers interactive visualizations, providing millisecond response times, enabling more versatile exploration, expanding the range of available dimensions and filters, and even maintaining subsecond speeds in the face of surging user and query volumes.

Druid’s unique architecture also enables easy scaling. By separating key duties among separate node types—data nodes for storage, master nodes for data ingestion and availability, and query nodes for executing queries and returning results via the scatter/gather method—Druid ensures that nodes can be independently scaled based on need. Afterward, Druid also automatically rebalances traffic to ensure consistent performance.

Salesforce: A Druid Success Story

Salesforce pioneered the customer relationship management (CRM) space, serves 150,000 customers worldwide and earns billions in annual revenue.

The Edge Intelligence Team is the division of Salesforce that tackles the massive task of ingesting, processing, filtering, aggregating, and querying the entirety of their log data — anywhere from billions to trillions of lines daily. Each minute, Salesforce ingests 200 million metrics, while each day, Salesforce processes five billion daily events globally. In total, Salesforce accumulates dozens of petabytes of data in their transactional store, five petabytes of logs in their data centers, and almost 200 petabytes in their Hadoop storage.

Salesforce teams use Druid to unlock real-time insights into product performance and user experiences, diving into large datasets instantly. Anyone from engineers to account executives can query a wide variety of dimensions, filters and aggregations to better understand trends, troubleshoot any issues that arise and set strategies for the future.

By using Druid’s compaction abilities, Salesforce also decreased the number of Druid rows by 82%, leading to an overall reduction of their storage footprint by 47% and accelerating their query response times by 30%.