Matt Caster, chief solution architect for Neo4j, explains how open source Apache Hop projects advances integration of DevOps and DataOps. The video is below followed by a transcript of the conversation.
Mike Vizard: Hey, guys, thanks for the throw. We’re here with Matt Casters, who is Chief Solution Architect at Neo4j and we’re talking about a new open source project called Apache Hop, that helps you move data round more easily. Matt, welcome to the show.
Matt Casters: Thank you very much for having me, Michael.
Vizard: So walk us through what exactly is Apache Hop, what’s the problem we’re trying to solve, and it seems like we’re trying to move more data than ever, and when I was a kid people told me moving data was not necessarily a great idea, but we’re doing it anyway. So what’s the challenge and what are we trying to solve?
Casters: Oh yeah, yeah. So that’s a great observation. I remember when everything was going to be the new mainframe, do everything in one box, right? And the opposite happened; we saw more and more services on prem, in the cloud, and all mixed up. See, I think, you know, more specialized services, in the cloud, on prem, graph databases, you know, advent of containerizations, virtual machines, and containers. So it’s become like a really complicated ecosystem out there and I think a lot of organizations are struggling to find expertise to service and cater to all those needs.
I think what we’re trying to do is make the life of the developer easy by wrapping up user interfaces and easy-to-use tools around all these technologies so that you get like UMLI diagrams that you can work with and that can be easily understood by not only the experts but medium-level educated people on subjects, right? For example, if you’re working with a Kafka queue you might need to know how it works in general, but you don’t need to dive into the Java or Python APIs to get the best out of it. I mean that’s kind of it.
So lower the maintenance costs. And this comes down to the DevOps side of things. And while we try to make the life of the developer easy we position ourselves exclusively on the side of the organization. We’re not made to just make life fun for the developer, but we really try to make the life fun for the developer, but we really try to make life for the company or the organization that uses the software better.
So with that in mind you come to the conclusion that, okay, we need to also focus on configuration management, life cycle management, version control, unit testing, security investment, and that companies put into often many years of work of data orchestration, data integration, ETL work, right? So that is what I think Hop has been trying to focus on the last couple years by making sure that all these technologies are standard in the tool and easy to use.
Vizard: Do you think that we’re reducing the need for specialists, ’cause back in the day you had to have somebody who knew how to program with ETL and that capability and now with this we’ve got more of a visual tool, and maybe some people still prefer CLIs and APIs, but this is designed for developers to just easily handle this task themselves without any outside intervention, right?
Casters: Yeah, that’s the lore. We fully acknowledge that it’s like a 95/5-percent kind of deal, right; you’re never going to do the last 5-percent without some scripting or coding. But solving the 95-percent is a big deal, it means that all of a sudden you open up possibilities to people that we would never have expected of using Apache Hop. Right? I’ve seen salespeople use it to extract the sales numbers from Salesforce. I would’ve never expected that in the past. It’s like, “Oh, they can figure it out.” “Oh, I can just enter the details: username, password.” “Oh, I can read this” and before you know it people start to use that.
Vizard: Citizen developers are here with us today. One of the things about Hop is that it’s lightweight and I don’t think people appreciate just how much heavy lifting went into ETL historically. So why does it matter that this thing is lightweight and metadata-driven and what’s the impact?
Casters: Well, at some point we saw the advent of LANDesk, you know, microservices, but also in its previous life, Kettle, before that was used was for plotting trains in more like an IOT settings. Call these Edge devices, maybe Raspberry PIs or very tiny devices. And there’s a real use case out there for a device that doesn’t have to do it or anything; it’s just trimmed down to the bare essentials for what it needs to do. Maybe it would be like measure temperature and air humidity and pressure and then send it off every 30 seconds or something, over some – using some Rescall.
And then this whole Hop library trims down to something like 30-40 megabytes and starts up fast and uses next to no memory. That is not in contrast to saying like, oh, maybe I run it on an AWS LAN in a small services. And it also gives us the ability to create new packets, right? You can only remove them, remove functionality, but you can add functionality quite easily as well. And that is really important. We’ve also already seen people implement machine learning or libraries or Python iteration and plug-ins for Hop. So this will be our next goal, is to create a marketplace around that plug-in kernel kind of architecture.
Vizard: Do you think that the functions of DevOps and DataOps are starting to converge more? Is this becoming more of a team sport or is DataOps just going to be folded into DevOps? What’s the relationship?
Casters: Well, what I’ve seen personally is now that we have the ability to lock everything that a workflow or pipeline does into a graph, a Neo4j graph, it makes it easy to see where the arrow was, right? Can use graph algorithms to figure out not just where an arrow was, but figure out what the execution pat was in milliseconds. So the convergence of, “Oh, the developer has an easy time figuring out what went wrong, but also on Monday morning when you come in the office and see that the Saturday night run failed for some reason, or the monthly load of the data warehouse or whatever people are doing, they don’t have to troll through millions of lines of loading texts for a large workflow, but just see straightaway what went wrong, where it went wrong, saves a lot of time.
So is that DevOps? For sure. For sure DevOps. It’s also ease of development. There’s really like an overlap and we need to do better in providing a lot of these tools. They’re still too primitive in a lot of our software, As if locking text is enough, right? We can do so much more. [Laughs] We’re executing stuff; we know where it went wrong. Why guess? Why go through the effort? [Laughs]
Vizard: There’s a lot of focus these days on the volume of data that people are trying to work with, but it seems to me it’s also the speed in which that data is starting to emerge more, and people are trying to work in near-real time, and the data is going to be updated. We’re moving beyond batch. So are we really prepared for not just the volume of data, but the speed at which that data is starting to emerge. And for that matter, there’s a lot more types of that data, right?
Casters: Yeah. So being from the Hop from the beginning opted to choose Beam to work as an abstraction layer so that we could execute on Spark, Flank, and Dataflow for the really fast use cases, but also, yeah, like real-time streaming, and just like playing, just streaming, Spark streaming and cloud Dataflow are really cool. And this comes along with the advent of all the streaming back-answer messaging services: Kafka, Pulsar, jewel clouds pops up. I think repertoire, right? I probably forgot like Azure. [Laughs]
But yeah, so these services have become really popular in making it possible to create cool Hop and Spoke architectures, where you can send stuff in all directions. Yeah, so it falls under the sort of like streaming category, and you can do all sorts of cool load-it-and-see exercises there. And opening that up is quite easy for us in Hop; all the pipelines are streaming, whether it’s in batch or not. So yeah.
Vizard: So what’s your best advice to folks as they kind of look at these DevOps/DataOps challenges? What do you see as the best practice that people do the right thing? And for that matter, what kind of mistakes are people making that maybe they should avoid?
Casters: So I think when it comes to these best practices, and then we’re talking about the classical list, put everything in version control. You know, keep your passwords and usernames and everything separate. Don’t hardcode anything. Like these are not specific to ETL data orchestration, but they’re genera programming best practices – development best practices. So even if it is like graphical programming, it still applies. Same with unit testing and integration testing.
What we’ve seen in the past is that if you want to do, for example, unit testing, there’s a lot of plumbing that you need to put in place. So specialized Docker containers and so on and so on and so on. What we’ve tried to do is always look at the return investment of metadata. So what is the return investment for setting up like a lot of infrastructure and a lot of manual plumbing? It’s very low, right? So you see some benefit from a unit test like that, but it’s not great. So by building the integration testing and unit testing into Hop, into GUI, making it very easy to create, we’re saying like, “Okay, now you really don’t have any reason not to do it anymore. Right? [Laughs] Well, apart from laziness or not knowing what the value is. But at least those best practices are not different from any other development job that you might have.
Vizard: All right, so we finally reached a point where we’re going to automate the orchestration of the data, just like the rest of any other DevOps process for that matter. So it’s onwards and upwards. Matt, thanks for being on the show.
Casters: Hey, you’re welcome. It’s good to talk to you, Michael.
Vizard: All right. Back to you guys in the studio.