DevOps.com

  • Latest
    • Articles
    • Features
    • Most Read
    • News
    • News Releases
  • Topics
    • AI
    • Continuous Delivery
    • Continuous Testing
    • Cloud
    • Culture
    • DataOps
    • DevSecOps
    • Enterprise DevOps
    • Leadership Suite
    • DevOps Practice
    • ROELBOB
    • DevOps Toolbox
    • IT as Code
  • Videos/Podcasts
    • Techstrong.tv Podcast
    • Techstrong.tv Video Podcast
    • Techstrong.tv - Twitch
    • DevOps Unbound
  • Webinars
    • Upcoming
    • On-Demand Webinars
  • Library
  • Events
    • Upcoming Events
    • On-Demand Events
  • Sponsored Content
  • Related Sites
    • Techstrong Group
    • Container Journal
    • Security Boulevard
    • Techstrong Research
    • DevOps Chat
    • DevOps Dozen
    • DevOps TV
    • Techstrong TV
    • Techstrong.tv Podcast
    • Techstrong.tv Video Podcast
    • Techstrong.tv - Twitch
  • Media Kit
  • About
  • Sponsor
  • AI
  • Cloud
  • Continuous Delivery
  • Continuous Testing
  • DataOps
  • DevSecOps
  • DevOps Onramp
  • Platform Engineering
  • Low-Code/No-Code
  • IT as Code
  • More
    • Application Performance Management/Monitoring
    • Culture
    • Enterprise DevOps
    • ROELBOB

Home » Blogs » Will Deduplication Solutions Work for Cassandra?

Will Deduplication Solutions Work for Cassandra?

Avatar photoBy: Jeannie Liou on July 27, 2016 Leave a Comment

Cassandra is a popular next-generation database system (NoSQL) that powers the back end of high-performance web-scale applications in enterprises. It is database software for cloud applications that accelerates the organizations’ ability to power the growing number of cloud applications that require data distribution across data centers and clouds.

Recent Posts By Jeannie Liou
  • When Did Safety Nets Become Optional?
  • Ransomware will Double in 2017. Are you Prepared?
  • A Modern Database Backup and Recovery Checklist
Avatar photo More from Jeannie Liou
Related Posts
  • Will Deduplication Solutions Work for Cassandra?
  • A Modern Database Backup and Recovery Checklist
  • Protecting Data in Public Cloud and Hybrid Cloud
    Related Categories
  • Blogs
  • DevOps Toolbox
    Related Topics
  • Cassandra
  • clusters
  • data
  • data storage
  • database
  • database backup
  • deduplication solutions
Show more
Show less

While Cassandra offers high availability, it also provides significant opportunities for meeting data protection requirements. Creating cluster-consistent and space-efficient backup of a distributed database can be a challenging task. In this blog post, I will highlight existing deduplication solutions for Cassandra, why deduplication matters, and how to achieve deduplication in Cassandra.

TechStrong Con 2023Sponsorships Available

What is Cassandra?

Here is a brief introduction. You might be curious, “Aren’t existing deduplication solutions enough to save space for Cassandra snapshot files?”  There are multiple deduplication solutions that can eliminate redundant data from the different levels of the storage layer. So why do we need a new one for Cassandra cluster backups?

Let’s begin with some background information.

Cassandra is a distributed database that is becoming increasingly popular with the emergence of big data applications such as software as a service (SaaS), internet of things (IoT) and real-time analytics. These applications require high availability and scalability over consistency. Cassandra supports eventual consistency rather than strict consistency, which is provided by traditional database systems such as Oracle, MySQL and IBM DB2. “Eventually consistent” means consistency will be achieved eventually rather than immediately. As well-known in CAP theorem, we cannot have following three properties in a single system: consistency, availability and partition tolerance. In short, Cassandra is an eventually consistent database that provides high performance, high availability and high scalability—but not strong consistency.

Replication and its Role in Deduplication

One of the most important mechanisms of distributed scale-out database systems such as Cassandra is data replication. By replicating the same data on different nodes across failure boundaries, distributed database systems can continue to service application requests even with a certain number of node failures. The downside is performance overhead to maintain multiple data copies; both write and read operations will be slower to create multiple copies and to check the consistencies among multiple copies. Although asynchronous data replication technique can be used to minimize the performance overhead of writes, it also would lower the level of guaranteed consistency level, which called is eventual consistency.

As I just explained, replication plays a very important role in a distributed database system, and therefore we should not remove the redundancy from a live Cassandra cluster to save storage space.

The situation becomes different when we think about backup files (or, secondary data) from a Cassandra cluster. Like any other database system in an enterprise organization, backups are needed for Cassandra; it is not because Cassandra is not reliable or not available enough, but primarily because people make mistakes (“fat fingers”) and enterprise applications sometimes have to keep the history of their databases. As they say, to err is human!

Cassandra has a nice node-level snapshot feature, which can persist a complete state of an individual node to snapshot files. One very important point is that a Cassandra snapshot is a “per-node” operation, which does not guarantee anything about the cluster status as shown in the figure below.

1

Backing up a Cassandra Cluster

To create a backup of a Cassandra cluster, we have to trigger a snapshot operation on each node, collect created snapshot files, then claim the set of collected snapshot files as a backup. In this backup, the replicated data exist as is, and the size of the backup will be N times bigger than the size of user data, where “N” is the replication factor. Replication has an important role in a “live” Cassandra cluster to provide high availability and scalability, but what’s the use of replication for backups? If we upload the backup files to an object store such as S3 or Swift, the “already replicated” data will be replicated again by the object store to provide reliability and availability for their own sake. In short, there are redundant data copies in Cassandra backup files (secondary data). If we can eliminate the redundant data copies we will save massive storage space for Cassandra backups without sacrificing retention periods. Saving on storage space directly translates to saving big bucks for the operational cost of maintaining and operationalizing a big data system across an organization.

But, let’s return to the question of whether existing deduplication solutions would work for Cassandra backup files. You can test this by collecting and placing Cassandra database files to a deduplication system. Will the solution can save storage consumption? No! Existing deduplication solutions wouldn’t work for Cassandra data files for the following two reasons:

  • Cassandra has a masterless peer-peer architecture. Each node receives a different set of rows and, therefore, there are no identical nodes in a cluster, which means data files will look different, as shown in the following figure. Since different combination of rows are stored in Cassandra data files, each data file will be hardly identical even at the chunk level.2
  • Cassandra data files are compressed with 64KB-sized chunks regardless of row boundaries. If you know enough about deduplication algorithms, you can easily understand why Cassandra data files cannot be easily deduplicated. Fixed-length, chunk-based deduplication will not work because of the chunk alignment, and variable-length chunk-based deduplication will not work because of compression.

Cassandra’s compaction and small record size are other reasons why the existing block- or file-level deduplication solutions will not work for Cassandra backup files. Compaction is an independent operation that merges multiple database files into a new data file; small-sized records can be hardly deduplicated with bigger-sized chunk.

Summary

The existing file- and block-level deduplication solutions will not work for Cassandra backup files because:

  1. Each Cassandra data file contains different set of records.
  2. Cassandra data files are compressed regardless of the row boundary.
  3. Cassandra runs compactions independently.
  4. Cassandra can have smaller record size than deduplication chunk size.

Filed Under: Blogs, DevOps Toolbox Tagged With: Cassandra, clusters, data, data storage, database, database backup, deduplication solutions

« Reinventing Monitoring for Private, Hybrid Clouds
Don’t Make Me Code: Environment Protection »

Techstrong TV – Live

Click full-screen to enable volume control
Watch latest episodes and shows

Upcoming Webinars

Moving Beyond SBOMs to Secure the Software Supply Chain
Tuesday, January 31, 2023 - 11:00 am EST
Achieving Complete Visibility in IT Operations, Analytics, and Security
Wednesday, February 1, 2023 - 11:00 am EST
Achieving DevSecOps: Reducing AppSec Noise at Scale
Wednesday, February 1, 2023 - 1:00 pm EST

Sponsored Content

The Google Cloud DevOps Awards: Apply Now!

January 10, 2023 | Brenna Washington

Codenotary Extends Dynamic SBOM Reach to Serverless Computing Platforms

December 9, 2022 | Mike Vizard

Why a Low-Code Platform Should Have Pro-Code Capabilities

March 24, 2021 | Andrew Manby

AWS Well-Architected Framework Elevates Agility

December 17, 2020 | JT Giri

Practical Approaches to Long-Term Cloud-Native Security

December 5, 2019 | Chris Tozzi

Latest from DevOps.com

New Relic Bolsters Observability Platform
January 30, 2023 | Mike Vizard
Let the Machines Do It: AI-Directed Mobile App Testing
January 30, 2023 | Syed Hamid
Five Great DevOps Job Opportunities
January 30, 2023 | Mike Vizard
Stream Big, Think Bigger: Analyze Streaming Data at Scale
January 27, 2023 | Julia Brouillette
What’s Ahead for the Future of Data Streaming?
January 27, 2023 | Danica Fine

TSTV Podcast

On-Demand Webinars

DevOps.com Webinar ReplaysDevOps.com Webinar Replays

GET THE TOP STORIES OF THE WEEK

Most Read on DevOps.com

What DevOps Needs to Know About ChatGPT
January 24, 2023 | John Willis
Microsoft Outage Outrage: Was it BGP or DNS?
January 25, 2023 | Richi Jennings
Optimizing Cloud Costs for DevOps With AI-Assisted Orchestra...
January 24, 2023 | Marc Hornbeek
Dynatrace Survey Surfaces State of DevOps in the Enterprise
January 24, 2023 | Mike Vizard
Deploying a Service Mesh: Challenges and Solutions
January 24, 2023 | Gilad David Maayan
  • Home
  • About DevOps.com
  • Meet our Authors
  • Write for DevOps.com
  • Media Kit
  • Sponsor Info
  • Copyright
  • TOS
  • Privacy Policy

Powered by Techstrong Group, Inc.

© 2023 ·Techstrong Group, Inc.All rights reserved.