DevOps.com

  • Latest
    • Articles
    • Features
    • Most Read
    • News
    • News Releases
  • Topics
    • AI
    • Continuous Delivery
    • Continuous Testing
    • Cloud
    • Culture
    • DevSecOps
    • Enterprise DevOps
    • Leadership Suite
    • DevOps Practice
    • ROELBOB
    • DevOps Toolbox
    • IT as Code
  • Videos/Podcasts
    • DevOps Chats
    • DevOps Unbound
  • Webinars
    • Upcoming
    • On-Demand Webinars
  • Library
  • Events
    • Upcoming Events
    • On-Demand Events
  • Sponsored Communities
    • AWS Community Hub
    • CloudBees
    • IT as Code
    • Rocket on DevOps.com
    • Traceable on DevOps.com
    • Quali on DevOps.com
  • Related Sites
    • Techstrong Group
    • Container Journal
    • Security Boulevard
    • Techstrong Research
    • DevOps Chat
    • DevOps Dozen
    • DevOps TV
    • Digital Anarchist
  • Media Kit
  • About
  • AI
  • Cloud
  • Continuous Delivery
  • Continuous Testing
  • DevSecOps
  • Leadership Suite
  • Practices
  • ROELBOB
  • Low-Code/No-Code
  • IT as Code
  • More Topics
    • Application Performance Management/Monitoring
    • Culture
    • Enterprise DevOps

Home » Blogs » Will Deduplication Solutions Work for Cassandra?

Will Deduplication Solutions Work for Cassandra?

By: Jeannie Liou on July 27, 2016 Leave a Comment

Cassandra is a popular next-generation database system (NoSQL) that powers the back end of high-performance web-scale applications in enterprises. It is database software for cloud applications that accelerates the organizations’ ability to power the growing number of cloud applications that require data distribution across data centers and clouds.

Recent Posts By Jeannie Liou
  • When Did Safety Nets Become Optional?
  • Ransomware will Double in 2017. Are you Prepared?
  • A Modern Database Backup and Recovery Checklist
More from Jeannie Liou
Related Posts
  • Will Deduplication Solutions Work for Cassandra?
  • The Top 10 Reasons to Use Cassandra
  • Things to Know When Planning for Cassandra Backup
    Related Categories
  • Blogs
  • DevOps Toolbox
    Related Topics
  • Cassandra
  • clusters
  • data
  • data storage
  • database
  • database backup
  • deduplication solutions
Show more
Show less

While Cassandra offers high availability, it also provides significant opportunities for meeting data protection requirements. Creating cluster-consistent and space-efficient backup of a distributed database can be a challenging task. In this blog post, I will highlight existing deduplication solutions for Cassandra, why deduplication matters, and how to achieve deduplication in Cassandra.

DevOps/Cloud-Native Live! Boston

What is Cassandra?

Here is a brief introduction. You might be curious, “Aren’t existing deduplication solutions enough to save space for Cassandra snapshot files?”  There are multiple deduplication solutions that can eliminate redundant data from the different levels of the storage layer. So why do we need a new one for Cassandra cluster backups?

Let’s begin with some background information.

Cassandra is a distributed database that is becoming increasingly popular with the emergence of big data applications such as software as a service (SaaS), internet of things (IoT) and real-time analytics. These applications require high availability and scalability over consistency. Cassandra supports eventual consistency rather than strict consistency, which is provided by traditional database systems such as Oracle, MySQL and IBM DB2. “Eventually consistent” means consistency will be achieved eventually rather than immediately. As well-known in CAP theorem, we cannot have following three properties in a single system: consistency, availability and partition tolerance. In short, Cassandra is an eventually consistent database that provides high performance, high availability and high scalability—but not strong consistency.

Replication and its Role in Deduplication

One of the most important mechanisms of distributed scale-out database systems such as Cassandra is data replication. By replicating the same data on different nodes across failure boundaries, distributed database systems can continue to service application requests even with a certain number of node failures. The downside is performance overhead to maintain multiple data copies; both write and read operations will be slower to create multiple copies and to check the consistencies among multiple copies. Although asynchronous data replication technique can be used to minimize the performance overhead of writes, it also would lower the level of guaranteed consistency level, which called is eventual consistency.

As I just explained, replication plays a very important role in a distributed database system, and therefore we should not remove the redundancy from a live Cassandra cluster to save storage space.

The situation becomes different when we think about backup files (or, secondary data) from a Cassandra cluster. Like any other database system in an enterprise organization, backups are needed for Cassandra; it is not because Cassandra is not reliable or not available enough, but primarily because people make mistakes (“fat fingers”) and enterprise applications sometimes have to keep the history of their databases. As they say, to err is human!

Cassandra has a nice node-level snapshot feature, which can persist a complete state of an individual node to snapshot files. One very important point is that a Cassandra snapshot is a “per-node” operation, which does not guarantee anything about the cluster status as shown in the figure below.

1

Backing up a Cassandra Cluster

To create a backup of a Cassandra cluster, we have to trigger a snapshot operation on each node, collect created snapshot files, then claim the set of collected snapshot files as a backup. In this backup, the replicated data exist as is, and the size of the backup will be N times bigger than the size of user data, where “N” is the replication factor. Replication has an important role in a “live” Cassandra cluster to provide high availability and scalability, but what’s the use of replication for backups? If we upload the backup files to an object store such as S3 or Swift, the “already replicated” data will be replicated again by the object store to provide reliability and availability for their own sake. In short, there are redundant data copies in Cassandra backup files (secondary data). If we can eliminate the redundant data copies we will save massive storage space for Cassandra backups without sacrificing retention periods. Saving on storage space directly translates to saving big bucks for the operational cost of maintaining and operationalizing a big data system across an organization.

But, let’s return to the question of whether existing deduplication solutions would work for Cassandra backup files. You can test this by collecting and placing Cassandra database files to a deduplication system. Will the solution can save storage consumption? No! Existing deduplication solutions wouldn’t work for Cassandra data files for the following two reasons:

  • Cassandra has a masterless peer-peer architecture. Each node receives a different set of rows and, therefore, there are no identical nodes in a cluster, which means data files will look different, as shown in the following figure. Since different combination of rows are stored in Cassandra data files, each data file will be hardly identical even at the chunk level.2
  • Cassandra data files are compressed with 64KB-sized chunks regardless of row boundaries. If you know enough about deduplication algorithms, you can easily understand why Cassandra data files cannot be easily deduplicated. Fixed-length, chunk-based deduplication will not work because of the chunk alignment, and variable-length chunk-based deduplication will not work because of compression.

Cassandra’s compaction and small record size are other reasons why the existing block- or file-level deduplication solutions will not work for Cassandra backup files. Compaction is an independent operation that merges multiple database files into a new data file; small-sized records can be hardly deduplicated with bigger-sized chunk.

Summary

The existing file- and block-level deduplication solutions will not work for Cassandra backup files because:

  1. Each Cassandra data file contains different set of records.
  2. Cassandra data files are compressed regardless of the row boundary.
  3. Cassandra runs compactions independently.
  4. Cassandra can have smaller record size than deduplication chunk size.

Filed Under: Blogs, DevOps Toolbox Tagged With: Cassandra, clusters, data, data storage, database, database backup, deduplication solutions

Sponsored Content
Featured eBook
The State of Open Source Vulnerabilities 2020

The State of Open Source Vulnerabilities 2020

Open source components have become an integral part of today’s software applications — it’s impossible to keep up with the hectic pace of release cycles without them. As open source usage continues to grow, so does the number of eyes focused on open source security research, resulting in a record-breaking ... Read More
« Reinventing Monitoring for Private, Hybrid Clouds
Don’t Make Me Code: Environment Protection »

TechStrong TV – Live

Click full-screen to enable volume control
Watch latest episodes and shows

Upcoming Webinars

Accelerating Continuous Security With Value Stream Management
Monday, May 23, 2022 - 11:00 am EDT
The Complete Guide to Open Source Licenses 2022
Monday, May 23, 2022 - 3:00 pm EDT
Building a Successful Open Source Program Office
Tuesday, May 24, 2022 - 11:00 am EDT

Latest from DevOps.com

DevOps Institute Releases Upskilling IT 2022 Report 
May 18, 2022 | Natan Solomon
Creating Automated GitHub Bots in Go
May 18, 2022 | Sebastian Spaink
Is Your Future in SaaS? Yes, Except …
May 18, 2022 | Don Macvittie
Apple Allows 50% Fee Rise | @ElonMusk Fans: 70% Fake | Microsoft Salaries up by 100%?
May 17, 2022 | Richi Jennings
Making DevOps Smoother
May 17, 2022 | Gaurav Belani

Get The Top Stories of the Week

  • View DevOps.com Privacy Policy
  • This field is for validation purposes and should be left unchanged.

Download Free eBook

The 101 of Continuous Software Delivery
New call-to-action

Most Read on DevOps.com

Top 3 Requirements for Next-Gen ML Tools
May 13, 2022 | Jervis Hui
15 Ways Software Becomes a Cyberthreat
May 13, 2022 | Anas Baig
Why Over-Permissive CI/CD Pipelines are an Unnecessary Evil
May 16, 2022 | Vladi Sandler
Apple Allows 50% Fee Rise | @ElonMusk Fans: 70% Fake | Micro...
May 17, 2022 | Richi Jennings
Making DevOps Smoother
May 17, 2022 | Gaurav Belani

On-Demand Webinars

DevOps.com Webinar ReplaysDevOps.com Webinar Replays
  • Home
  • About DevOps.com
  • Meet our Authors
  • Write for DevOps.com
  • Media Kit
  • Sponsor Info
  • Copyright
  • TOS
  • Privacy Policy

Powered by Techstrong Group, Inc.

© 2022 ·Techstrong Group, Inc.All rights reserved.