DevOps Chat: CHAOSSEARCH New Spin on Big Data Analysis

Analysis is the dark cloud overhanging big data — it’s difficult without the right tools. Elasticsearch by Elastic has become ubiquitous in helping organizations, but it can get unwieldy, as storage combining with compute is tricky. CHAOSSEARCH is an early stage startup designed to make searching massive amounts of data using elasticsearch and Amazon S3 easy.

Pete Cheslock, VP of Products at CHAOSSEARCH and a good friend of DevOps.com, joined me in this DevOps Chat to discuss his latest company’s technology and how it might make big data searches less daunting. The key for big data search and analysis is separating storage from compute, using Amazon S3 and Elasticsearch.

As usual, the streaming audio is immediately below, followed by the transcript of our conversation.

Audio

Transcript

Alan Shimel: Hey everyone, it’s Alan Shimel, DevOps.com, and you’re listening to another DevOps Chat. Today’s DevOps Chat features a returning guest to DevOps Chat. He’s actually probably been on DevOps Chat a number of times, right from I think the very beginning of when we first launched DevOps.com. He’s pretty well-known in the DevOp community and really needs no introduction from me. It’s Mr. Pete Cheslock. Pete, welcome to DevOps Chat.

Pete Cheslock: Hey, thanks a lot. Thanks for having me.

Shimel: Yep. Pete, always a pleasure to have you, man. You know, I remember speaking with you – actually it was probably before I even launched DevOps.com, up in Boston. And you kind of gave me your views on what was then a very kind of just starting to blossom DevOps community and DevOps movement. And since that time obviously a lot of water under the bridge and many different hats that you’ve worn, Pete. Here to introduce today though with your latest position in helping a company grow, and that company is called CHAOSSEARCH, correct?

Cheslock: Yeah, that’s correct.

Shimel: And Pete, you’re VP of Products?

Cheslock: Yeah, yeah, VP Products, Products, VP of Products? I don’t know, it’s a working title. We’re still trying to figure out what the best way to –

Shimel: As long as they don’t call you a DevOps engineer, it’s all good. [Laughs]

Cheslock: Exactly.

Shimel: I hear ya. So Pete, well let’s start off with this. I’m going to assume a lot of people listening to this today don’t really know or are not familiar with CHAOSSEARCH, so why don’t we start with that? What’s CHAOSSEARCH?

Cheslock: Yeah, absolutely. And yeah, I think that’s definitely a good assumption. You know, we’re an early-stage startup. We’re just going through seed funding right now, so we’re just going through an A round of fundraising, so then we’re going to be coming out to the market in a few months here. But basically what CHAOSSEARCH is is they’re trying to make it easier to access and make query-able your data as it exists on object stores. So if you think of an object store like Amazon’s S3 service. And so they developed basically a technology that allows you to store data. They basically process your data from S3 and store it back into your bucket. So what’s really interesting is they don’t hold any of your date. They’re simply – you know, you give them like a read-only to your S3 data. They’ll read in that data and shove the compressed, processed data back into your bucket. And then from there they expose APIs to allow you to access it.

So for example, they’ll extend the S3 API to give you some really interesting capabilities of slicing and dicing your data. But the thing that they’ve exposed that’s most interesting, which got me excited about the company, is that they’ve exposed the elasticsearch APIs on top of your data on S3. And so it’s really compelling, because I spent a very long time as a really early user of elasticsearch, and it’s a great technology; it’s a great open source database. But it gets unwieldy over time. And so what really blew me away when I saw the product and what they’d built is that I can point open source log tools like Kibana to the CHAOSSEARCH APIs, accessing my own data on S3. No elasticsearch required. And so the kind of short pitch, we’re trying to make that long tail of data available to people, whether it’s log data or event data of some kind, where customers can run much smaller elasticsearch clusters of their log data or event data, and then they can essentially push the data up to S3. We’ll process it, and now they can get access to that data for weeks, months or even years.

Shimel: Yeah. And Pete, that really highlights sort of a soft-white underbelly or a dark corner around the whole big data issue. It’s great to have – I mean, big data and – big data is only as good as the analysis we can do with it, right? That we can perform on it. And with big data we could certainly get some really keen insights to show us relationships, trends, facts that we might otherwise miss. But the dirty little secret was, for big data, to really get the value out of it, generally you needed big money kind of search to perform that kind of analysis. Because the bigger your data got, the harder it was to actually manage it, search it, analyze it. And so CHAOSSEARCH seems like a real solution for those people who are literally drowning in data.

Cheslock: You’re exactly right. I worked at a company eight or nine years ago called Sonian, which – not a lot of people have heard of Sonian. They were one of the earliest users of Amazon’s S3 and EBS. At the time we were probably the top-three user of S3 and maybe the top-five user of EBS. What’s amazing about Sonian is the number of companies that actually came out of that place. Companies like CloudHealth, which just got acquired by VMware. Sensu was a technology that was built inside of there. Stackdriver was another one. And the reason all these startups came out of Sonian is because they were so early in the cloud space, they had to build out the technology to support it.

One of the other ones that kind of came out on the side, not really related to us, was elasticsearch. We were trying to index customer e-mail data, like an e-mail archiving product. And so we got connected with Shay from elasticsearch before he really started the company because we were so blown away by what he had built. And elasticsearch at the time made sense, right? Using Lucene to store indexes of customer e-mail data allowed the indexes to be much smaller than the sourced e-mail files. But then over time, over the last say eight years or so, elastic’s been around, there’s been this explosion of using elasticsearch for log data. And in many cases, like structured JSON that we’re shoving into elasticsearch. And due to the inverted index nature of Lucene, the data blow up inside of Lucene where you might actually store four or five times more in disc storage on elasticsearch.

So if you shove in a terabyte of JSON logs into your elasticsearch cluster, that could be three or four terabytes of Lucene indexes. And then you have to think about like, “Well, I want durability and reliability, so now I need to like run a replicant index.” So now you’re talking for every terabyte into elasticsearch you need like eight terabytes of physical disk. What was really amazing about what the founders and the early team here has created is the format, kind of solving for the Lucene problem. They’ve actually been able to reduce the size, yet still making the data query-able, using kind of the same APIs, which is what really blew me away, in that it’s now – you know, so many people have to make the choice. Do I keep my data for analysis, at a really high price of elasticsearch? Or, do I throw it away? And I think a lot of people just throw that data away, because they just can’t make it worthwhile in any sort of cost effective way.

Shimel: Right, not in an economic way. Pete, we’re far from a big data solution here at MediaOps, between our seven or eight websites, but I will tell you, just pure search, like an index search on DevOps.com has become a struggle. Because a lot of the native WordPress search – you know, especially if it’s on server, is killing resources on your web server and so forth. So you want to do it sort of off-server search and it’s just a pain in the butt. Take my word for it. So this is great. You said the company’s launching in a few months though? Not that it’s launching – it’s looking for seed money now. Product will be in market. Where exactly is it now in terms of – is it in beta? Is it alpha?

Cheslock: So yeah, we’re probably going to be closing our A fundraising round over the next 30 days or something like that. The company actually started about two and a half years ago, and so they’ve been essentially building this technology out over the past couple of years and really trying to refine it as they were kind of were coming up with this like product market fit. Like what’s the market we’re going into? And as the founders really looked at the space, they realized that log and event management is where a lot of pain is right now. What’s interesting is when people start using the product, they realize that they can process any kind of data with this and query it. And that’s where there’s a lot of interest, where maybe it’s not just log and event data. Maybe it’s like some long tail of data that you want to get access to or to monetize for your customers.

But where we’re at right now, and what was exciting about me coming onboard is not only did my role change completely – I used to run Technical Operations for Threat Stack, which is a pretty successful cloud security company that’s still just crushing it. I changed my role over to become the Head of Product. So working with customers to understand their use cases and how they’re using tools like elasticsearch. So where we’re at right now is I’m basically looking for what we’re calling design partners, where – the best way to describe the product is in that beta phase, where we’re looking for people with really interesting datasets and maybe people who are on Amazon and have data on S3 or are using elasticsearch, and getting them into our proof of concept setup right now, where we can have them run queries, test the analysis, test how they might use it and really help us refine the final bits as we kind of _____ things up towards GA.

And then, once we go to GA, towards – I would say October timeframe I think is what we’re aiming for, it’s really to continue to refine these really specific use cases like, “I’ve got data on S3 and I want to process this data.” We’ve had some customers that want to push their elasticsearch indices up through the service, which is a really interesting idea, because those loosing indexes already have a defined schema. And so that way, instead of storing seven days of data in elasticsearch, maybe they just store one day and that’s in their hot cluster in elasticsearch on EC2 or something. And then days two through 365 are up on S3 and are available through the CHAOSSEARCH service. So yeah, that’s kind of what the next month or two looks like for me. You know, really trying to work with customers, people with interesting datasets. You know, people who have just way too much elasticsearch and they want less of it. You know, they want to kind of run less servers, spend less money – which I feel like is kind of everyone out there right now.

Shimel: Excellent. Very cool stuff. I mean, Pete, look, for you to get up and leave Threat Stack and come over here, obviously it had to be something really appealing and compelling, to bring you over there. Pete, it’s interesting. My background prior to DevOps.com was founding and building companies, and one of the things I learned is it’s great to have these grand – come up with the idea of Docker for containers and change the way – you know, the basic platform that we use, right? For our apps and for IT. But those revolutionary kinds of things are really kind of few and far between, especially the ones that succeed. Really IT moves along in evolutionary kind of ways, where it’s a thing like this. It’s thing like with CHAOSSEARCH is doing, that makes it easier, better, faster, more economical to really get a hold of your big data and do the analysis and get a handle on it that you need to make it valuable. Right?

Cheslock: Yeah. Exactly right.

Shimel: And so, you know, this just feels that way to me – that this is a lot – some people may say, “Well, it’s not earthshattering.” But no, but it’s the logical step. It’s the glue that you need to – you know, big data is a big concept and it’s great and it’s going to revolutionize things when we get it. But you need the CHAOSSEARCH or sort of the glue to make these things happen, right?

Cheslock: Yeah, what’s amazing is, as I’ve talked to people running a lot of elasticsearch, a lot of data on S3, it’s a really easy sell, because I can go to them and get them away from running these huge clusters that – they may not even take a lot of queries too. That’s what’s interesting is maybe they’re only querying their data in an automated way the last hour or 24 hours. But then you start looking out seven days, a few weeks, maybe a few months; those queries are significantly less. And so you have to either build out like a tiered storage. But at the end of the day the storage and computer are coupled so tightly that your scalable units on Amazon or any cloud provider is a lot of disk and a lot of CPA all together. And what CHAOSSEARCH has done, it uncoupled storage and compute, where the storage is on S3, specifically your S3 account, and the compute is the CHAOSSEARCH service.

And so it’s a totally different way of thinking about, “How do I access my data? How do I make it query-able?” And now, because it costs so little – and we’re talking like go-to-market pricing a tenth of the price of running elasticsearch yourselves, and significantly less than a lot of the other kind of hosted management, log manager vendors. You know, it’s like now I can keep my data for month and a year and I can run machine learning models cost-effectively, because I can keep so much data. That’s the part that’s super interesting, is, how does this change data science and those companies that really want to have the long tail of data for ML/AI testing, and how does this change how they work their models, how they build it in? And we’re even trying to build in like tenser-flow integration into the application so that you can just access it from within without needing to have all this crazy data science experience.

Shimel: Excellent. It really is. Pete, you know, 15 minutes goes quick. We’re coming up on our thing here. You’re VP of Product. Where do you see this going six months, a year, two years out? It’s a long way to look, but where do you see yourself taking the product direction here?

Cheslock: Yeah, no, I always like to joke that the next six months are clear and in focus. Things get really fuzzy really quick after that. But the nice thing is I think we have a pretty clear plan going out of it. Through the end of the year, reaching GA, bringing some customers onboard that we can solve a problem for them. But really, early next year we’re going to start beta-ing a self-service kind of model, where people can come in and try it out. And probably doing even a freemium model, because since the data is in your S3 bucket and we’re just processing that data, cost is very low. So being able to provide someone a couple of gigs a day of data processing, it’s pretty easy for us. And then that way people can kind of get their feet wet and try it out and we can really start building a community around it.

And then from there, I think the sky’s the limit. You know, we’re looking at some time end of next year, the ability to run it within your own data centers. Right now we want to be kind of pure SaaS so that we can continue and build it out and improve it in our kind of very quick, iterative development cycle. But really it’s a data platform we’re building. I mean, we’re going after that log and event management space because it’s so painful for so many people and it costs a lot of money. But the long-term for us is, what other APIs could we expose on top of your data on S3? Elasticsearch seemed to make a lot of sense, given the market. We’ve extended S3 with some interesting features. But like what else is there? What are different ways people are trying to access different data? And so I think that will be a lot of fun to kind of chat with people, understand their pain and maybe actually bring them a nice solution to that pain.

Shimel: Yeah, absolutely. And I’m sure you’ll have a ball doing it, Pete. Anyway, we’re over time here. Pete Cheslock, newly minted VP Products, CHAOSSEARCH. Thanks for joining us on this episode of DevOps Chat, and we look forward to hearing more about CHAOSSEARCH and about your adventures, Pete.

Cheslock: Yeah, thanks a lot. Thanks for having me. This was a lot of fun.

Shimel: Always a pleasure. Pete Cheslock, CHAOSSEARCH. This is Alan Shimel for DevOps.com. You’ve just listened to another DevOps Chat.

— Alan Shimel