Will Deduplication Solutions Work for Cassandra?

Cassandra is a popular next-generation database system (NoSQL) that powers the back end of high-performance web-scale applications in enterprises. It is database software for cloud applications that accelerates the organizations’ ability to power the growing number of cloud applications that require data distribution across data centers and clouds.

While Cassandra offers high availability, it also provides significant opportunities for meeting data protection requirements. Creating cluster-consistent and space-efficient backup of a distributed database can be a challenging task. In this blog post, I will highlight existing deduplication solutions for Cassandra, why deduplication matters, and how to achieve deduplication in Cassandra.

What is Cassandra?

Here is a brief introduction. You might be curious, “Aren’t existing deduplication solutions enough to save space for Cassandra snapshot files?” There are multiple deduplication solutions that can eliminate redundant data from the different levels of the storage layer. So why do we need a new one for Cassandra cluster backups?

Let’s begin with some background information.

Cassandra is a distributed database that is becoming increasingly popular with the emergence of big data applications such as software as a service (SaaS), internet of things (IoT) and real-time analytics. These applications require high availability and scalability over consistency. Cassandra supports eventual consistency rather than strict consistency, which is provided by traditional database systems such as Oracle, MySQL and IBM DB2. “Eventually consistent” means consistency will be achieved eventually rather than immediately. As well-known in CAP theorem, we cannot have following three properties in a single system: consistency, availability and partition tolerance. In short, Cassandra is an eventually consistent database that provides high performance, high availability and high scalability—but not strong consistency.

Replication and its Role in Deduplication

One of the most important mechanisms of distributed scale-out database systems such as Cassandra is data replication. By replicating the same data on different nodes across failure boundaries, distributed database systems can continue to service application requests even with a certain number of node failures. The downside is performance overhead to maintain multiple data copies; both write and read operations will be slower to create multiple copies and to check the consistencies among multiple copies. Although asynchronous data replication technique can be used to minimize the performance overhead of writes, it also would lower the level of guaranteed consistency level, which called is eventual consistency.

As I just explained, replication plays a very important role in a distributed database system, and therefore we should not remove the redundancy from a live Cassandra cluster to save storage space.

The situation becomes different when we think about backup files (or, secondary data) from a Cassandra cluster. Like any other database system in an enterprise organization, backups are needed for Cassandra; it is not because Cassandra is not reliable or not available enough, but primarily because people make mistakes (“fat fingers”) and enterprise applications sometimes have to keep the history of their databases. As they say, to err is human!

Cassandra has a nice node-level snapshot feature, which can persist a complete state of an individual node to snapshot files. One very important point is that a Cassandra snapshot is a “per-node” operation, which does not guarantee anything about the cluster status as shown in the figure below.

Backing up a Cassandra Cluster

To create a backup of a Cassandra cluster, we have to trigger a snapshot operation on each node, collect created snapshot files, then claim the set of collected snapshot files as a backup. In this backup, the replicated data exist as is, and the size of the backup will be N times bigger than the size of user data, where “N” is the replication factor. Replication has an important role in a “live” Cassandra cluster to provide high availability and scalability, but what’s the use of replication for backups? If we upload the backup files to an object store such as S3 or Swift, the “already replicated” data will be replicated again by the object store to provide reliability and availability for their own sake. In short, there are redundant data copies in Cassandra backup files (secondary data). If we can eliminate the redundant data copies we will save massive storage space for Cassandra backups without sacrificing retention periods. Saving on storage space directly translates to saving big bucks for the operational cost of maintaining and operationalizing a big data system across an organization.

But, let’s return to the question of whether existing deduplication solutions would work for Cassandra backup files. You can test this by collecting and placing Cassandra database files to a deduplication system. Will the solution can save storage consumption? No! Existing deduplication solutions wouldn’t work for Cassandra data files for the following two reasons:

Cassandra has a masterless peer-peer architecture. Each node receives a different set of rows and, therefore, there are no identical nodes in a cluster, which means data files will look different, as shown in the following figure. Since different combination of rows are stored in Cassandra data files, each data file will be hardly identical even at the chunk level.
Cassandra data files are compressed with 64KB-sized chunks regardless of row boundaries. If you know enough about deduplication algorithms, you can easily understand why Cassandra data files cannot be easily deduplicated. Fixed-length, chunk-based deduplication will not work because of the chunk alignment, and variable-length chunk-based deduplication will not work because of compression.

Cassandra’s compaction and small record size are other reasons why the existing block- or file-level deduplication solutions will not work for Cassandra backup files. Compaction is an independent operation that merges multiple database files into a new data file; small-sized records can be hardly deduplicated with bigger-sized chunk.