Open Source Vs. Proprietary DataOps

Core DataOps concepts are making their way into data engineering teams and, from there, into the broader enterprise. Data engineers are retooling how they create data products, and much of this work revolves around creating data pipelines.

DataOps pipelines offer the kind of observability that traditional data integration and ETL processes don’t or can’t. They allow you to continuously integrate and test new data sources and deliver data in streaming or batch contexts with higher levels of quality and reliability than traditional, siloed approaches. They can also support machine learning efforts by preparing data for training, testing and deployment.

DataOps pipelines shorten the cycle time between a data consumer’s request and how quickly and reliably the data producer can fulfill it, all while observing service quality throughout its life cycle. The challenge teams face when building pipelines is how to construct them. The two approaches, as I see it, are do-it-yourself using open source, or through commercial product acquisition.

Open source certainly has its vocal adherents and some visible successes, but there are trade-offs to taking the open source route.

Building Or Buying Your Pipeline

One advantage of open source is in its flexibility and availability. Open source licenses, excluding the SSPL, gives users incredible freedom over what they can do with the software. If you have the skill, you can compose a DataOps pipeline that can take any data, enrich it and route it to the right place. That flexibility, though, is also a downside. While you can do anything you want, you also have to do it. Open source projects like Kafka, Pulsar, Spark, Airflow and Flink don’t know anything about the data they’re handling. That’s up to the developer.

This may not sound like a problem, but today’s data engineers are handling dozens of data types in hundreds – or even thousands – of different formats. If you add in operational data, you’re also looking at data flooding in from firewalls, containers, SNMP traps and HTTP sources. And that’s just what’s coming at you. You also need to fetch data from object stores, multiple activity hubs and other messaging sources. No open source project natively supports the variety and volumes of data required in a modern DataOps pipeline. Every new data source you add means starting from primordial components rather than higher-level abstractions, adding time and undifferentiated effort to your work.

You’ll also need essential features like per-source backpressure, support for a range of protocols and data stores, role and permission management and so on. That’s a lot for overworked data engineering teams to build and maintain over the life of a pipeline.

There’s no ‘easy button’ for proprietary products, either. Many commercial products only support a handful of sources and destinations and may have excessive licensing costs related to data ingest rates or per-connector licensing fees. Also, they can be challenging to scale, from an operations perspective. Seeing a cool drag-and-drop interface may look great during the demo, but it quickly becomes a cluttered mess when you’re looking at thousands of similar data processing pipelines.

The advantages to commercial products center around the features enterprises need, like governance, management and security. They may also integrate with other components more readily than open source. Finally, having someone to call when you need support is also essential when running in production, although open source core companies also offer support.

Take a Higher-Level View

As you’re evaluating DataOps pipeline options, keep two things in mind. First, regardless of which path you choose, ensure your chosen technology offers observability and monitoring features. DataOps is about reacting to change faster, and you can’t react if you don’t know what’s going on. Building observable DataOps pipelines ensures you are delivering a valuable data product to the range of data consumers in your enterprise.

Next, get past the technology. DataOps is more a people discipline than a technology practice. A key component of a successful DataOps implementation is aligning metrics and incentives across teams. If the staff running the DataOps pipeline are disconnected from the business outcomes their data products inform and influence, they’re not incentivized to run the pipeline as a mission-critical infrastructure component. It’s key to align all of your teams with shared metrics and incentives.

DataOps Determination

Before you decide on your DataOps pipeline implementation, there are some things you should do.

Determine who your customers are and what their expectations are around data delivery, enrichment, user experience and observability requirements. Are you only serving data scientists, or are you also serving infrastructure and operations, business intelligence and marketing? Knowing your customer allows you to make better decisions.
Test the software in a real-world environment. Everyone dislikes being told to perform a proof of concept before you buy/build, because POCs are time consuming. Remember – they’re less time-consuming than picking the wrong tool and looking foolish later.

Engineering teams sometimes choose technologies that are in fashion rather than selecting the right tool for the job. Taking a pragmatic view of performance, manageability and engineering effort, as well as involving a wide array of stakeholders, will give you a higher chance of success with DataOps.