Data Monoliths Go Cloud-Native with Kubernetes

Breaking monolithic applications into independently scalable services is the core of the cloud-native application model. By making cloud-native apps easy to deploy and manage, Kubernetes has contributed to a big change in IT practices.

Initial cloud-native conquests started with stateless services and quickly grew to encompass applications that manage data. Database management systems have long relied on centralized servers, but cloud-native architectures are now reaching even the largest monoliths. Among the most interesting cloud-native converts are analytic databases, which hold data arranged in storage formats optimized for extremely fast analytic queries over billions ,or even trillions, of rows.

The transformation of databases to cloud-native models has profound implications for data-driven applications. Intersecting trends in analytic database architecture and Kubernetes promise substantial benefits to the users of such applications as well as companies that operate them.

The Rising Importance of Analytics

Over the last decade the quantity of data that businesses generate has decoupled from the size of the business itself. Ad networks, website analytics and service observability applications often ingest 1 million events per second (or far more) and may have tables running into trillions of records. Businesses need to analyze this information rapidly–for example, to diagnose faults in cloud applications or to retrain models for displaying online ads. Flexible, real-time analytics have gone from nice-to-have to business critical.

Meanwhile, new analytic engines have emerged that eliminate dependencies on specialized hardware and complex networking required by legacy data warehouses. Druid, ClickHouse or Elasticsearch run on commodity hardware and depend on basic compute, network and storage. Everything else, such as high availability (HA), is built into the application. Newer analytic databases offer built-in horizontal scaling using fleets of server processes. The server processes themselves typically run well in containers.

With these changes, analytic databases are increasingly cloud-friendly if not cloud-native. It is a short hop to deployment on Kubernetes.

Kubernetes Operators for Databases

At the same time, as analytic databases have shifted their operational model, Kubernetes itself has become more welcoming for databases.

One particularly important innovation for the database is custom resource definitions, or CRDs, which extend the Kubernetes API to define new types of manageable resources. Using CRDs makes it possible to have a single definition that shows the desired state of a complex application such as a data warehouse.

Another addition is Kubernetes operators, which manage custom resource definitions. Operators track custom resource definitions and take actions when they change. They can also track other information that might be relevant to managing the custom resource, such as changes to pods, storage or even related cloud-native applications. It’s like having a human operator who is constantly watching the application, except the rules are programmed into software that runs inside Kubernetes itself.

The operator pattern is a perfect match for analytic databases, which are complex distributed applications and may require definition of dozens or even hundreds of individual Kubernetes resources in a running system. Operators encapsulate these details as well as important procedures such as upgrade, scale-out and other operations that require careful manipulation of resources to prevent outages.

It’s not surprising operators are popular with database developers. The OperatorHub.io registry for database operators alone has 19 entries covering databases of every kind, including analytic databases such as ClickHouse and Elastic. Many more exist outside the registry.

A Cloud-Native Way to Consume Analytic Databases

New data warehouse architectures and Kubernetes operators means users can now use analytic databases very differently from legacy data warehouses. Our work on ClickHouse, including development of the ClickHouse Kubernetes Operator, prompts two observations.

First, operators make spinning up analytic databases simple and fast for all users. Amazon Redshift kicked open the door to data warehouse-as-a-service in the public cloud. Kubernetes operators enable similar capabilities not just in public clouds but any Kubernetes cluster. Individual developers can cycle through dozens of configurations in a single day while testing new analytic applications. Thanks to quick pod spin-up new analytic databases can be running in minutes or even less.

More profoundly, users can use Kubernetes to move to an on-demand model for analytics. Rather than forcing all applications to compete for resources on a single shared cluster, each application can have an analytic database tuned to its specific requirements. Analytic databases can integrate tightly with applications in the same locations where those applications already run. Businesses can scale databases flexibly to deliver quality analytics at exactly the time they are needed and at reasonable cost. This is the heart of the cloud-native programming model.

Modern database architectures combined with Kubernetes features, such as the operator pattern, are creating a new path to manage enormous volumes of data. For readers who are hesitant about managing data on Kubernetes, it’s time for another look.

To learn more about containerized infrastructure and cloud native technologies, consider coming to KubeCon + CloudNativeCon NA, November 18-21 in San Diego.

— Robert Hodges