Falling Back in Love with Data Pipelines

It’s Spring 2020, and even in challenging and unprecedented times, love continues to bloom. That said, data pipelines are never far from my thoughts and the confluence of those two things got me thinking about data engineers and how their initial love for creating data pipelines often turns to loathing.

Data pipelines are the backbone of modern data systems, yet there’s still much to be desired when it comes to connecting and orchestrating the movement of data between data systems, including business intelligence, data processing, data warehousing or other applications. As an industry, we haven’t paid as much attention to the happiness of the people who build and maintain data pipelines as we have to the volume or velocity they support. As a consequence, today’s data engineers are building increasing resentment toward the technologies and tools that previously offered so much promise.

What is happening here, and can it be fixed?

Love at First Sight

When we first met, everything was magical. We traveled the data world, far and wide, no longer bound by the constraints of relational databases and ETL tools. Vertical scaling was a thing of the past, and with horizontal scaling, we felt we could reach the end of the earth together.

The introduction of Hadoop, Spark, Kafka and more was a truly magical evolutionary stage for data pipelines. We could store and process more data than ever before, derive insights never thought possible, and do it all at a fraction of the cost.

What was there not to love?

Definitely, Maybe

As our courting phase came to a close, we were happier than ever. We were spending more and more time together, eagerly building a future. We were both getting accolades at work, accompanied by exciting promotions and responsibilities. We were also growing increasingly comfortable with each other, and while at times it felt you were being a bit clingy, it was nice to be needed.

With pipeline technology continuing to advance, it felt scale and speed had few, if any, limits. We built more and more pipelines, linking outputs and inputs, tuning performance and performing increasingly complex transformations, all as part of an increasingly interconnected data ecosystem.

Our data teams grew, and with more data engineers we built even more pipelines. We also started to notice that these pipelines are quite a bit higher maintenance than originally thought. They require frequent tuning and optimizations, yet still break far more often than they used to, usually at the least desirable times–like at 3 a.m.

This is definitely turning into a love/hate kind of thing.

10 Things I Hate About You

I miss the old days. What happened? I’m just trying to hang out with my friends and you won’t stop calling–it’s suffocating me. When we do spend time together, the romance is gone. Horizontal scalability has lost its magic.

Most data engineering teams we encounter are firmly planted in this stage of relationship with data pipelines. They’re honestly feeling a bit trapped. They’ve built a ton of remarkable pipelines that do incredible things. These pipelines also require a tremendous amount of maintenance, and each new one they build only compounds the problem. As a result, we see most data engineering teams actively distancing themselves from data pipeline technology, building increasingly large layers of abstraction that sit between them and the raw technology.

Surely, all is not lost?

Something’s Gotta Give

Right when I thought we were at our tipping point, we had a breakthrough. We weren’t communicating effectively, our languages of love were entirely different. I wanted to focus on the destination, the end result, and you wanted to focus on the journey, the path to get there. There clearly was a gap, and we needed help bridging it.

Fortunately, for the data engineer and their muse, the data pipeline, there are advances in pipeline technology, bridging the gap and offering a path toward reconciliation. One such path combines declarative programming with an underlying control system that understands the declarative model. Such control planes translate declarative models into imperative instructions for the underlying infrastructure, making pipeline development far less plumbing, and far more architecting. Declarative programming allows data engineers to focus on the big picture, drawing on their experience and creativity to design what the optimal data pipeline should look like, rather than spending time repairing failures, tuning parameters or manually tracing code.

The underlying control system, or control plane as it’s called, imbues the pipeline with autonomy, data awareness and intelligent persistence. Autonomous data pipelines are less needy and can handle data changes on their own (rather than making the data engineer responsible for everything). Data awareness allows pipelines to intuitively know what changed and what subsequently needs to change. Intelligent persistence allows pipelines to be considerate of cost and resources and makes sure everything is fast, efficient and never duplicated.

Love Actually

I think it’s safe to say those among us who chose to enter data engineering didn’t do so imagining ourselves, infrastructure plumbers, spending hours upon hours maintaining data pipelines that fuel someone else’s data-driven creativity. No, data engineers want to create digital services and contribute business value, too–they’re just trying to keep everybody else out of the data muck in the meantime.

This is the dawn of a new era in data engineering, and many of the early adopters of this latest approach to pipeline development have found it enables creation with far less time, code and maintenance. By embracing this new method of development, data engineers can again see a happy future where more of their time and skills are devoted to the things for which they–and the business–have real passion.