A Beginner’s Guide to Apache Druid

Beginners Guide to Apache Druid

Apache Druid is a stream-native, cloud-native, analytics database, designed to be used in situations where real-time analytics queries and ingest really matter.

Knoldus states that Apache Druid is a high-performance columnar data storage system that can be used to execute real-time analytics on a massive data set. 

Druid has been utilized by large companies such as Airbnb and Netflix because it scales well. Companies can easily manage real-time data input, update their database, and create output in the blink of an eye. 

As someone new to Apache Druid Architecture, understanding the basic building blocks is the first step in utilizing this analytics database. As such, this article hopes to offer insight into what Apache Druid does differently, and how it does it.


How Does Apache Druid Work?

As Towards Data Science reports, Druid is the intersection of OLAP databases, Timeseries databases, and search systems. At its heart, Druid utilizes different types of nodes, and each one is designed to handle a specific part of the system’s processing. 

The cluster of nodes forms the druid system. These nodes may exist on a single machine or they can be distributed throughout dozens or even hundreds of linked systems. Druid has four distinct types of notes that work together:

  • Real-time Nodes: These perform the functions of real-time querying and storing the results in a columnar fashion on a buffer, enabling fast results for the query.
  • Historical Nodes: These utilize the information produced by the real-time nodes, serving them for other processes. They aren’t aware of other nodes, and so a failure in a single node doesn’t impact the others.
  • Broker Nodes: These nodes handle sending queries to historical and real-time nodes. They are also responsible for aggregating the data that historical and real-time nodes produce.
  • Coordinator Nodes: These interact specifically with historical nodes, informing them of when data is outdated and when they need to reload from the source. They can also serve as redundant backup points.


Why is Druid so Good at Real-time Analytics?

Druid is very fast in how it delivers results to users. Druid’s processing speed was benchmarked by Xavier Léauté and promoted by Apache, showing how it outperformed SQL queries significantly on a large data set. The reason Druid is so good for real-time analytics boils down to a few key elements of the system.

Column-Oriented Data Storage

Data stored in columns is usually more efficient for compression ratios. Additionally, Druid doesn’t need to load all the columns to perform a query. At any point in time, a query will only need to access a single column of information, shortening the load time. String columns are even further compressed using LZF compression techniques.

Estimation of Cardinality

Cardinality estimates are a problem with larger data sets. Druid overcomes this by using the HyperLogLog algorithm. As stated by Flajolet et al. in the Conference on Analysis of Algorithms in 2007, the algorithm produces an estimation for the cardinality with approximately 97% of accuracy. 

The functionality offered by the HyperLogLog algorithm means that sites can get a real-time estimate for the number of users that are online at any point in time.

Caching and Load Balancing

Broker nodes maintain a pre-segment cache of previously executed queries which they can return when needed, once the data set doesn’t change. Historical and real-time nodes also perform caching to increase the speed of scans. 

Coordinator nodes work to distribute segments among historical nodes, so there isn’t an imbalance. Multiple historic nodes can then serve several queries instead of burdening a single node with all the processing.

Partitioning Based on Time

Timestamps are a significant part of Druid’s performance efficiency. The system can sort data according to timestamps, and so older data can be broken down into months, weeks, and days. Timestamps sorting further enables the system to speed up its writing to disk since it can safely ignore older data as a non-priority. Partitioning can also aid in replicating and distributing segments more efficiently.

Unnecessary Scanning Avoided

By having an indexed list of the locations that a particular value occurs, Druid can limit its selection to only the records in which the value is present. The use of this reference table makes accessing the data within the system much more efficient. 

Unlike a relational database, there is no need to search through every single row and column to locate the search term. Druid can access the locations directly and pull the data without having to search through redundant records unnecessarily.

When Should a Business Use Druid?

Apache Druid is an excellent distributed data store for real-time analytics, but there are situations where relational databases may be more useful. Searching through a data set using a primary key to update a record is better with traditional database systems. Druid doesn’t handle real-time updating, as those jobs happen in the background batches.

If speed isn’t a concern in your business analytics,  Apache Druid might not be a proper fit for you. Druid is a solution to the problem of real-time data analytics and processing massive stores of data quickly and efficiently. 

The system can scale with the growth of the business seamlessly, meaning that there is less need for upgrading those systems. 

Related Posts

Share on facebook
Share on twitter
Share on linkedin
Share on reddit
Share on pinterest