No-Nonsense Guide to Distributed Databases

I got my first DBA job 15 years ago, and quite a few things have changed since then. The amount of data exploded, we saw the rise of NoSQL, and business logic moved away from being stored in relational databases to applications.

To this day, the three pillars of reliability, scalability, and maintainability are the banes of a DBA’s existence. To spice things up, your monthly cloud bill for managed databases just made your CTO lose the will to live.

So, now you have two extra things to worry about in addition to these three pillars: cost efficiency and vendor-agnosticism.

In this article, I’ll (try to) run an open source, new-generation database, benchmark it and touch on some of the challenges of being cloud-neutral.

TL;DR

Solving the SQL vs. NoSQL dilemma in the Kubernetes world
How to leverage a distributed SQL database in a multi-cloud world
Taming performance – single cloud versus multi-cloud
Running TPC-C as a transactional benchmark
Taming the costs – the question of egress costs

New Packaging For a New Age

Just five years ago, people were like, “Containerizing stateless microservices? Sure! No problem.”

But when you actually containerize a database, your persistence layer is asking for trouble.

Over time, containers grew on us. Now, it’s the default way of packaging and shipping software. Heck, even AWS Lambda caved and started running container images. Databases just couldn’t resist.

Once a database is containerized – say, a PostgreSQL database – Kubernetes takes over dealing with some of the more painful aspects of reliability and maintainability (with the help of middleware-specific operators).

Containers decouple data from the runtime, so upgrades and configuration are easier through infrastructure-as-code (YAML). Hardware failure and maintenance windows are covered by Kubernetes’ built-in capability to reschedule pending pods on surviving nodes. Full scheduled backups or point-in-time recovery are in the hands of the operator’s automation.

But what about scalability?

Kubernetes’ vertical pod autoscaler can adjust container resources, deleting and recreating pods with new resource settings. As a traditional relational database, PostgreSQL can have only one primary replica that writes data. This means that you’ll be facing a short downtime while the new pod comes online. The majority of workloads are about five times oversized, so while sizing is painful, it’s better than no sizing at all (please, have mercy on your CTO’s budget).

NoSQL has none of these limitations. Databases like Cassandra were built for scaling horizontally without downtime.

Enter Distributed SQL

What if we could blend Google Spanner’s ability to operate at distances with strong ACID consistency, Cassandra’s scaling and PostgreSQL compatibility?

This is where YugabyteDB helps.

YugabyteDB is an open source, relational, distributed SQL database management system. It was designed to handle large amounts of data across multiple availability zones and geographic regions, all the while delivering single-digit latency, high availability and no single point of failure.

I hate when the new kid on the block brings their own syntax language because, you know, their case is so very special – just like everybody else’s.

Yugabyte doesn’t do that. They have a wicked-smart storage layer (inspired by Google Spanner) that reuses the PostgreSQL query engine. This means no mental strain on the development team – you get to use the time-tested, marketable and incredibly powerful PostgreSQL language-level features that many are already comfortable with.

If Google Spanner is so inspiring, why is it not taking over the world of distributed SQL? Well, it’s available only as SaaS, and heavily depends on the specifics of Google Cloud hardware like atomic clocks (trueTime) and Google’s backbone network. If your private data center or favorite cloud provider is not GCP – well, no Spanner for you, my friend. Personally, I’d be hesitant to marry into the ultimate vendor lock-in. Yugabyte doesn’t have any dependencies and would happily run on any cloud, private data center, VM or bare metal – and if your coffee machine runs Docker containers, you can run Yugabyte there, as well.

If Yugabyte is robust, scalable and easily maintainable, you have the top three pillars covered. But what about the other two pillars your CTO is nagging you about (cost efficiency and a vendor-agnostic approach)?

How to Make Your Database Vendor-Agnostic

Being vendor-agnostic means being able to switch between major cloud providers at any point without depending on prohibitively expensive services like DynamoDB. This approach also allows you to leverage inefficiencies in cloud pricing so you can look for new opportunities to reduce cost and dynamically shift resources between cloud providers online.

This is something CAST AI can do.

The CAST AI platform acts as a cloud abstraction and makes the whole multi-cloud, vendor-agnostic requirement quite trivial.

Developers just interact with standard Kubernetes (no CRDs), and the complexity of several clouds is abstracted. A Kubernetes cluster in CAST AI can span across several cloud providers at the same time, bringing some unique cost optimization opportunities.

What if spot/preemptible instance interruption wasn’t totally random anymore, but something you could actually predict? You could be proactive and—by operating in several compute bidding markets at the same time—reduce the risk of running out of capacity.

Why not slash your monthly cloud bill radically by running some of the database containers on spot/preemptible nodes?

Highly Available Stateful Application on Multi-Cloud

Yugabyte on a multi-cloud CAST AI Kubernetes cluster is a robust design without a single point of failure.

Active/Passive solutions are so ’90s—failovers cause brief downtime, and DBAs tend to leave the failover to be manually initiated due to occasional short-lived network blips. Active/Passive solutions constantly require testing and validation. Will your disaster recovery work on darker days? Because these tests cause downtime, nobody does them regularly.

The best highly available solutions are active-active, don’t require user failover, and don’t rot if not tested regularly.

CAST AI has a tutorial in place that takes you through every step of building the above solution in about 10 minutes, from scratch.

Go ahead, open your cloud provider’s console and release your inner chaos monkey by killing Yugabyte, business app pods or whole Kubernetes nodes.

CAST AI, Yugabyte, and Kubernetes were built with a core tenet that nothing is guaranteed; nothing should be shared. But keep in mind, consensus algorithms work when there’s still a majority left to vote N/2 + 1.

You may be asking, what’s the performance impact of increasing resiliency when going multi-cloud with a distributed database?

Let’s find out.

Let’s benchmark a distributed Yugabyte database on top of CAST AI Kubernetes spanning AWS and GCP. These cloud providers are arbitrary choices here; Azure or DigitalOcean can be used in the mix, as well. We picked GCP and AWS because their VMs boot up a bit faster.

Mono-Cloud Vs. Dual-Cloud Network Benchmark Results

Let’s compare a Yugabyte containerized database on Kubernetes in a single-cloud scenario versus a distributed database working in two different clouds.

Setup

We took the OSS Yugabyte HELM chart with the image version “2.5.1.0” with the replication factor of 3.

Role	YB-Master	YB-Tserver
Description	Control plane, consensus	Serves data
CPU	3 CPUs	7 CPUs
RAM	3 GiB	14 GiB
Storage	50 GiB	300 GiB

Remote Network Block storage:

EBS gp2 (300 IOPs)
Standard persistent disk (R225/W450 IOps)

Nine pods, each on a separate VM to ensure plenty of resources:

Mono Cloud

Regional GKE k8s cluster with 9x GCE N2d-standard-8 distributed in three availability zones. Casual VPC network in Google’s US-East4 region.

Dual Cloud

AWS EC2 3x c5a.2xlarge

GCP GCE 6x n2d-standard-8

A network between cloud providers set up with a Wireguard-based full mesh VPN that encrypts traffic end-to-end (also inside the same VPC).

To minimize noise from compute and local storage efficiency, we didn’t optimize for cost but wanted to isolate and better understand how distributed databases would work in situations where database nodes are distributed single-digit milliseconds away (in this example, less than 9 ms).

Nine milliseconds of network round trip latency means a lot of flexibility in choosing between data center providers distributed within a few hundred miles.

The majority of cloud provider regions lack imagination and are clustered in the geographic vicinity of a densely populated metropolis. California, northern Virginia, Montreal, São Paulo, London, Frankfurt, Amsterdam, Mumbai, Tokyo, Singapore, Sydney, etc.

So, what are your options in, say, Frankfurt? AWS, Azure, GCP, DigitalOcean, Oracle, Alibaba, IBM Cloud, Hetzner, Linode—not to mention the white-label providers.

We carried out tests on U.S. East (Ashburn) and Europe Central (Frankfurt) with similar results.

How We Measured It

If we want to test relational database performance – the gold standard ruler to date is a synthetic TPC-C benchmark.

Why did we choose that? Because it’s the most mature OLTP benchmarking test suite that has stood the test of time. As stated here:

“While the benchmark portrays the activity of a wholesale supplier, TPC-C is not limited to the activity of any particular business segment, but, rather represents any industry that must manage, sell, or distribute a product or service.”

We loaded an identical Yugabyte configuration in both scenarios with 300 warehouses worth of data and then ran benchmarks several times.

Each test took half an hour, and the output was the single most important metric: transactions per minute. The test also provided different outputs per type of transaction, average and 99th percentile duration in milliseconds.

Results

This table shows a comparison of TPC-C results with the average duration for a single cloud on Google and AWS + Google in the same region, US-East.

300 warehouses	Mono Cloud	Dual Cloud
Num New Order transactions per 1800 s	113,071	113,247
TPM-C	3,769	3,774
Efficiency	97.6%	97.8%
NewOrder Latency avg msec	580.2	498.8
Payment Latency avg msec	30.8	46.7
OrderStatus Latency avg msec	33.8	32.4
Delivery Latency avg msec	123.1	174.9
StockLevel Latency avg msec	239.4	465

And comparing the 99th percentile:

300 warehouses	Mono Cloud	Dual Cloud
NewOrder Latency p99 msec	6985.6	5498.7
Payment Latency p99 msec	110.5	157.5
OrderStatus Latency p99 msec	146.3	185.1
Delivery Latency p99 msec	331.5	696.8
StockLevel Latency p99 msec	712.3	1083.3

As you can see, the difference between the mono cloud and dual cloud scenarios is quite small.

Some operations, like OrderStatus, are plain read operations. The nearest server can quickly respond with data to the client—usually, a microservice running on the same K8s cluster.

Yugabyte’s lowest isolation level is a snapshot. This means that those reads are protected from the undesirable effect of the “dirty read.”

What about writes?

Writes are a bit more expensive. Raft consensus, hybrid logical clocks and lockless multi-version concurrency control (MVCC) algorithm magic—all of this means a lot more chatter and replication between nodes to get write committed.

We see that, in payment and delivery operations duration (30ms vs. 46ms and 123ms vs. 174ms, respectively) the geographic distances start to be noticeable.

The actual network latency between nodes in different availability zones in the mono cloud scenario without network overlay and encryption is, on average, about 0.6ms. Network latency between two clouds with overlay (VXLAN) and encryption (Wireguard) is about 5ms.

The laws of physics aren’t going away anytime soon. But as we can see, the results are much better than expected.

YMMV, but industry accepted norm is that the majority of IO operations will be reads, at 70%; only remaining 30% writes, reducing the impact of speed of light even further.

Having said that, the gold standard for measuring OLTP database performance for the last two decades says that performance results are identical (TPM-C 3’769 vs. 3’774).

Data Gravity and Egress Costs

Data gravity refers to a scenario where compute gravitates closer to data.

For example, if you have a hybrid cloud or multi-cloud strategy but your data stays in a single location like an on-premises data center. Most of the compute with stateless services will tend to stay in the same data center, closer to data.

The whole point of multi-cloud and a high-performing distributed database was to get rid of data gravity. If we have the same data on two or more clouds, then the data gravity effect disappears.

If all your data is replicated to several locations, what about network traffic costs?

All cloud providers allow unlimited incoming (ingress) data. It doesn’t matter where data is coming from—if it’s your on-premises data center, a different cloud provider, the dark web or most cloud-native services. All the incoming data is free.

“You can bring as much data as you want to us. But if you want to take your data back – sorry, you need to pay.” This is universal between all cloud providers. All the outgoing (egress) traffic from your availability zone is paid! Some cloud providers, like Oracle Cloud, are very generous with egress costs, and companies like Zoom saved a ton of money when they moved some of Zoom’s infrastructure to Oracle Cloud.

DigitalOcean gives a generous egress monthly allowance per Droplet (VM), depending on size and how long you run it. But, at a minimum, it starts at Terabyte per smallest Droplet. The monthly egress allowance is aggregated in your account, and you pay only for the combined extra.

All clouds have some minimal free tier—but in general, egress costs start at around 10 cents per GB of data sent outside VPC, and, on average, 2 cents for data between availability zones in the same VPC (I bet you didn’t know that you pay 1cnt/GB in AWS and Azure for incoming traffic coming from same VPC/Vnet from another AZ).

If you send more traffic, your cost per GB falls. The first drop is at 10TB and generally flattens at around 5 cents at ~150TB.

How does it make you feel when cloud providers charge an eye-watering 97% margin on that egress traffic? It’s like we’re back in 1995 with dial-up modems, right?

There is a more cost-effective alternative. But first, let’s answer the ‘Will it Blend?’ question. Let’s add Direct Connect, Express Route, Cloud Interconnect and Layer 2 or Layer 3 providers like Equinix or Megaport. Yes, you get a Dedicated Fiber Link smoothie out of the blender.

Dedicated fiber link slashes costs to a mere 2 cents per GB. And if you have more egress traffic than 7TB, it’s more cost-effective than relying on the weird discount models from cloud providers.

You might be thinking, “2 cents for every GB?! That’s way too much! I’m using RDS—yes, I have vendor lock-in and data gravity—but at least I have none of the problems with egress cost, resilience, scalability and overall affordability of the service, right?”

That’s just the result of an opaque billing design.

https://twitter.com/QuinnyPig/status/1372598432095891458

All the traffic between Availability Zones in all clouds is paid at 2 cents per GB of egress+ingress. The only place where you pay for ingress is between Availability Zones in the same region (AWS/Azure). Some managed services offer free replication, but they charge double on compute costs, etc.

Through the ServiceTopology feature, Kubernetes can keep your application pod’s traffic inside the same cloud service provider Availability Zone and cloud, to avoid round robin, random calls between clouds. This means that you get the best latency and pay no traffic premium.

Most of the database activity is reads (70/30, YMMV) and are effectively free, egress-wise (coming from the local AZ due to the Kubernetes topology awareness). So, the only thing you replicate and subject to egress costs is your database change delta rate per month.

Now you can see that the issue of egress costs is an incredibly overstated problem.

The overwhelming push to the public cloud over the last decade brought on new challenges like vendor lock-in, data gravity and eye-watering monthly infrastructure bills.

During that time, innovation in non-traditional RDBMS, containerization and workload scheduling removed some of the old barriers.

Running a hands-free, strongly consistent, highly available and scalable database in heterogeneous data centers on a budget was a monumental task a decade ago. Today, it has become an affordable best practice for running our business applications.