Why Reinvent Deduplication? Isn’t Cloud Storage Cheap?

Most people assume cloud storage is cheaper than on-premises storage. After all, why wouldn’t they? You can rent object storage for $276 per terabyte per year or less, depending on your performance and access requirements. Enterprise storage costs between $2,500 to $4,000 per terabyte per year, according to analysts at Gartner and ESG.

This comparison makes sense for primary data, but what happens when you make backups or copies of data for other reasons in the cloud? Imagine that an enterprise needs to retain three years of monthly backups of a 100TB data set. In the cloud, this can be easily equate to 3.6PB of raw backup data, or a monthly bill of more than $83,000. That’s about $1 million a year, even before factoring in and data access or retrieval charges.

That is precisely why efficient deduplication is hugely important for both on-premise and cloud storage, especially when enterprises want to retain their secondary data (backup, archival, long-term retention) for weeks, months and years. Cloud storage costs can add up quickly, surprising even astute IT professionals, especially as data sizes get bigger with web-scale architectures—data gets replicated and they discover it can’t be deduplicated in the cloud.

The Promise of Cloud Storage: Cheap, Scalable, Forever Available

Cloud storage is viewed as cheap, reliable and infinitely scalable—which is generally true. Object storage such as AWS S3 is available at just $23/TB per month for the standard tier, or $12.50/TB for the Infrequent Access tier. Many modern applications can take advantage of object storage. Cloud providers offer their own file or block options, such as AWS EBS (Elastic Block Storage) that starts at $100/TB per month, prorated hourly. Third-party solutions also exist that connect traditional file or block storage to object storage as a back end.

Even AWS EBS, at $1,200/TB per year, compares favorably to on-premises solutions that cost 2 to 3 times as much and require high upfront capital expenditures. To recap, enterprises are gravitating to the cloud because the OPEX costs are significantly lower, there’s minimal upfront cost and you pay as you go (vs. traditional storage, which you have to buy far ahead of actual need).

Copies, Copies Everywhere: How Cloud Storage Costs Skyrocket

The direct cost comparison between cloud storage and traditional on-premises storage can distract from managing storage costs in the cloud, particularly as more and more data and applications move there. There are three components to cloud storage costs to consider:

Cost for storing the primary data, either on object or block storage
Cost for any copies, snapshots, backups or archive copies of data
Transfer charges for data.

We’ve covered the first one. Let’s look at the other two.

Copies of data. It’s not how much data you put into the cloud; uploading data is free, and storing a single copy is cheap. It’s when you start making multiple copies of data—for backups, archives or any other reason—that costs spiral if you’re not careful. Even if you don’t make actual copies of the data, applications or databases often have built-in data redundancy and replicate data (or, in database parlance, a Replication Factor).

In the cloud, each copy you make of an object incurs the same cost as the original. Cloud providers may do some deduplication or compression behind the scenes, but this isn’t generally credited back to the customer. For example, in a consumer cloud storage service such as DropBox, if you make one copy or 10 copies of a file, each copy counts against your storage quota.

For enterprises, this means data snapshots, backups and archived data all incur additional costs. As an example, AWS EBS charges $0.05/GB per month for storing snapshots. While the snapshots are compressed and only store incremental data, they’re not deduplicated. Storing a snapshot of that 100TB dataset could cost $60,000 per year, and that’s assuming it doesn’t grow at all.

Data access. Public cloud providers generally charge for data transfer either between cloud regions or out of the cloud. For example, moving or copying a TB of AWS S3 data between Amazon regions costs $20, and transferring a TB of data out to the internet costs $90. Combined with GET, PUT, POST, LIST and DELETE request charges, data access costs can really add up.

Why Deduplication in the Cloud Matters

Cloud applications are distributed by design and are deployed on non-relational massively scalable databases as a standard. In non-relational databases, most data is redundant before you even make a copy. There are common blocks, objects and databases such as MongoDB or Cassandra that have replication factor (RF) of 3 to ensure data integrity in a distributed cluster, so you start out with three copies.

Backups or secondary copies are usually created and maintained via snapshots (for example, using EBS snapshots as noted earlier). The database architecture means that when you take a snapshot, you’re really making three copies of the data. Without any deduplication, this gets really expensive. And existing solutions, designed for on-premises legacy storage, can’t help.

Not Just Deduplication — Semantic Deduplication

Most deduplication technology works at the storage layer, deduplicating blocks of data. This is highly efficient on centralized SAN or NAS storage, but breaks down if the data layer is abstracted from the storage—as it is in a distributed database such as MongoDB. Deduplication in this world needs to address two fundamental issues:

It has to work at the data layer, not the storage layer. To deduplicate data from a distributed cluster, the software has to understand and interpret the underlying data structure.
It has to eliminate redundant data before it gets written to the database. Once data is written, it gets replicated within the cluster, so it needs to be deduplicated in-flight.

The good news is there is semantic deduplication technology that works efficiently with distributed cloud applications that can help cut storage costs by up to 80 percent for databases such as MongoDB and Cassandra.

— Shalabh Goyal