In traditional DevOps, there are the complimentary forms of development operations (that I call DEVops), and development operations (that I like to call devOPS). Between them they automate the toolchain and bring the people working on getting an application out to users onto the same team—not necessarily the same organizational team, but using their strengths in tandem to meet the needs of the business.
DataOps is an interesting extension of DevOps. You still need all of the coding side to get data into the system and queries consistently maintained. You still need all of the operations side to get your database/noSQL/whatever up and running. In fact, while the development side in a DataOps environment might be (but isn’t necessarily) lighter, the operations side is almost always more complex. No matter the big data engine in use, it is a complex system in addition to other supported systems in a normal environment. My first installation of a big data environment (Cloudera, as it happens) was a weeks-long learning voyage. Only after I’d completed it did I use an automation tool (which is no longer available) to make it easy. My second round was hours … But it assumed the knowledge I had gained in the first attempt.
The other bit of DevOps—ongoing monitoring and management—is also more complex in a big data environment, but we’ll come back to that in a moment.
Once the normal DevOps systems are in place, ETL/data import tools will need to be supported also. The volume that these tools crunch through in a day makes their inclusion in DevOps in a data-heavy environment critical. If the data uptake is slow or the data itself inaccurate, there is an impact on the organization. This step also requires inclusion of data scientists, people who traditionally have not been pulled into the DevOps model. But they are the ones that can gauge the accuracy of data, and normally are responsible for data acquisition anyway.
Which brings us to the monitoring and management stage of DataOps. Normally when we talk monitoring and management in DevOps, we are talking about tools that ultimately help with things like availability, responsiveness, auto-scaling and recovery. DataOps needs all of these things, like any other application does. And we’ve already mentioned that it has a complex environment embedded in your complex environment. But it also (some would argue more importantly) needs data monitoring. A knowledge of the state of the data is imperative. Whether from a piece of code gone wrong or data poisoning, monitoring the health of big data archives is essential to keep results from getting out of whack. This monitoring will again involve the people who know the data—data scientists by whatever title. This iterative process will need to run relatively constantly, checking on whatever makes the data valuable, normally using cross-field checks to verify that data hasn’t somehow been corrupted, or comparing trending averages to current average to monitor for sudden swings in results.
Of course, for a system (and in research there are many) that spins up, finds results and spins down, the monitoring part is less essential, but for anything that keeps accessing and adding to the data pool, it is the key to a healthy application or application portfolio based upon the dataset.
If all that sounds a bit intimidating, you’re not alone. The phrase “DataOps” has been around for a few years, and while data scientists use it, it hasn’t yet gained serious traction in the enterprise. Here’s how I would approach it as CIO: Given that DevOps is running on other projects, that is the low-hanging fruit. Given the rate of change for servers (as queries—particularly long-running ones—increase and decrease, systems needs change is one example), traditional DevOps is a greater benefit to big data than many other corporate systems. So roll out DevOps first without serious concern for DataOps beyond how DevOps impacts ETL—not ETL data quality, but ETL data availability/performance. Only when the system and the team are in place and the software/hardware/code part of the overall system is fully in a DevOps environment, start looking at ways to move data into it. Data monitoring will be the big one, but ETL under the initial round of DevOps implementation will have shown places it could be improved and ETL data integrity can be added, so improving those areas also. A measured, step-by-step approach that builds off of organizational experience is key here. Data monitoring and management will require different tools than normal DevOps monitoring and management, so having the data scientists look into available toolsets and how they mesh with existing tools early on will help with eventual implementation.
Make no mistake, it is a big job, moving large data driven environments to DevOps. But the benefits are increased awareness of data quality and data quality issues, along with more responsive systems and teams that DevOps offers every business application. The amount of data available to any give organization is currently on a massive increase as IoT, open datasets, partnerships and ML drive greater pools. It is worth getting the automation and cooperation of DataOps into the mix as early as possible.