Introducing chaotic, unpredictable test software into your methodical testing regime is a good idea, right? Yes, it’s a branch of testing called Chaos Engineering, or Chaos Testing. Netflix’s Chaos Monkey famously introduced many of us to the idea that resilient systems, networks and software become more resilient and less brittle if we use chaotic testing methods to find their weak points before customers do.
An innovative engineer at xMatters launched a new open source chaos testing tool named after H.P. Lovecraft’s nightmarish character Cthulhu. Cthulhu is designed to test across multiple cloud providers, initially supporting Google Cloud with plans to support Amazon Web Services. It’s open source, free to use, looking for more contributors, and is available on GitHub.
We are joined on this DevOps Chats by Tobias Dunn-Krahn, CTO, and Gabrielle Gasse, lead engineer on the Cthulhu open source project, both at xMatters.
As usual, the streaming audio is immediately below, followed by the transcript of our conversation.
Mitch Ashley: Hi, everyone. This is Mitch Ashley, with DevOps.com, and you’re listening to another DevOps Chat podcast. Today, I’m joined by Tobias Dunn-Krahn, who is CTO at xMatters, and also Gabrielle Gasse who is lead engineer of a software project we’re gonna talk about called Cthulhu, and that’s our topic today is Cthulhu.
Now, if you’ve read the H. P. Lovecraft books, you know at least the fictional character that we’re talking about, but we’re gonna get into this software and find out a little bit more about what it does.
Tobias, Gabrielle—welcome to DevOps Chat.
Tobias Dunn-Krahn: Thank you.
Ashley: Well, let’s start by introducing yourselves. How about you, Tobias, if you would start, just introduce—tell us a little bit about yourself, what xMatters does, what you do there as CTO.
Dunn-Krahn: Sure. So, right, my name is Tobias Dunn-Krahn, CTO at xMatters. What that entails is purview for development operations, quality and product strategy. So, that’s my role.
What does xMatters do? xMatters is a digital services availability platform. What that means in a practical sense is that xMatters helps teams that are responsible for digital services provide a high level of up time and reliability for those services. So, what that means is filtering unwanted signals for those services, engaging resolvers when those signals are relevant, facilitating collaboration during the resolution of an incident and, as well, integrating tools in a tool chain to eliminate any manual effort that’s involved in resolving an incident so that it can be done in a very timely fashion.
So, that’s what xMatters does in a nutshell, and what I do there.
Ashley: Interesting, yeah. I’m interested to see how Cthulhu fits into that. Gabrielle, would you introduce yourself and tell us a little bit about what you do at xMatters?
Gabrielle Gasse: Yeah, of course. So, I’m the lead engineer working on Cthulhu, our chaos testing tool. On a day to day basis, I’m also just a Java developer working on the system itself.
When I came into the company about a year ago, I was tasked to verify the resiliency of our system as we moved it to Google Cloud. And so, I did some research on existing tools and realized that we need something a bit different that was available already. And so, that’s how I came to start building Cthulhu.
As—this is not the core product of xMatters, and we do want to promote the use of chaos engineering as a day to day practice for everyone, we decided to release the tool itself in open source.
Ashley: Great. Maybe we should start with that, for folks that don’t know what chaos engineering or the theory behind it is, do you wanna say a little bit about that, Gabrielle?
Gasse: Of course. So, the idea behind chaos engineering, if you think of a distributed system made of multiple microservices, you may end up with a fairly large amount of small VMs or containers running in parallel. And, as things are deployed in the cloud, sometimes failure happens. It always happens.
And so, chaos engineering takes that as a premise, and it says that if you expect to fail all the time, then you will build your software with resilience in mind so that failure is not a problem. And so, this is where the idea of chaos engineering comes from. And so, a tool like Cthulhu is something that will run in our system that introduced outages so that we can verify that our system detects those issues and in some case—in most cases, ideally—recover, programmatically or thematically from those failures so that we don’t need to page one of our engineers at 3 a.m.
Ashley: Tobias, is this a project or an idea that you came up with or you and Gabrielle did it together, Gabrielle came to you, or how did this all get started at xMatters?
Dunn-Krahn: No, I will claim no credit for this project. [Laughter] I will hand that completely to Gabrielle. But another way of looking at that is, as we decomposed our monolith over time into microservices, what we really wanted to do was reduce the scope of subject matter expertise and knowledge amongst our teams to focus that on a smaller part of our system. So, that reduces complexity for individual teams and being able to keep those services up and running and highly reliable.
But one of the things that it also introduces is, it chases that complexity into the space between the services—different failure modes of those services, different complexities around the dependencies.
So, during our digital transformation we recognized that early on, and Gabrielle came up with the idea of introducing chaos testing for this and also that the existing chaos testing tools were not suited for our purposes, and I’ll let her explain how Cthulhu was different.
Ashley: Okay, great. Gabrielle, why don’t you pick it up from there?
Gasse: Yes. So, a year ago, when we started working on chaos engineering, there was a few tools that was available. Chaos Monkey is a very popular one. At the time, Chaos Monkey was working for Amazon Web Service primarily, and also it’s in deployment in Spinnaker. But it didn’t work for us, and so, we’re on Google Cloud.
We came up with this tool that can connect to lots of different cloud services—currently, we support Google Cloud and Kubernetes, but we’re adding other services like Amazon Web Service, for example. And so, it can run those failure scenarios in concert together, and also, we wanted to have the ability to version control those tests so that when we find a particular scenario that does cause a failure, we could have a way to produce it as a file, if you like, that we could then attach to a bug that we could give to the engineer to work on.
Ashley: Let me ask, then, you mentioned Chaos Monkey, which to me, was sort of a natural comparison for what you’re doing—does Cthulhu run in the background, kinda running all the time, creating its havoc, much like Chaos Monkey does of getting things to go down and see if they recover in a production environment? Is it similar for Cthulhu or are you taking a little bit of a different approach?
Gasse: It is similar, although there is a different way to run it. In usual scenarios at xMatters, we are running it on a need basis. So, we have a small scenario that we’ll run immediately. It’s also possible to run Cthulhu, as you mentioned, in the background and it will pick up tests and run them, essentially.
Ashley: Mm-hmm. So, run it in a targeted control manner, or let it run in the background and do its thing, it can go either way?
Gasse: Yeah, exactly. The thought was very much to have this ability to run it constantly like Chaos Monkey, but also give a tool for an engineer to perform a scenario locally or in a development environment to try and reproduce and understand failures.
Ashley: And you mentioned, of course, Chaos Monkey was created by Amazon Web Services, functions very well within an AWS environment—was multiple cloud services the main differentiator for Cthulhu, or was also getting into Kubernetes, Docker, Containers something that you kinda took on as a special added functionality that Cthulhu does that maybe another tool doesn’t? What are all of the differentiators that you’ve created Cthulhu for?
Gasse: Chaos Monkey was actually created by Netflix, who have their stack on Amazon Web Services.
Ashley: Oh, you’re right. Yeah, my mistake—thanks for correcting me.
Gasse: But, as you mentioned, Chaos Monkey works on a virtual machine in Amazon Web Service exclusively. There are other tools like Pumba that works at the Docker Container level. Powerful Seal came out last year around the time that we were looking at that, and Powerful Seal works on Kubernetes deployment at the pod, but also at the node level, which is pretty interesting.
But there’s no one system that will do—that will test across all of those platforms. You need to use Powerful Seal in concert with Pumba, and there was nothing that was running directly on Google Cloud at the time. Now, Chaos Monkey does support Google Cloud.
So, the main value that Cthulhu brought for us was the ability to have a single tool that could perform complex failure scenarios across both our Kubernetes deployments and our VMs. And—yeah, and we wanted, also, the ability, as I mentioned earlier, to version control some of those tests so that we could pass them on to our developer where Chaos Monkey and those other tools, you run them with some parameters to filter out certain machines that you don’t want it to break. And then it just rolls from there. There’s nothing—there’s no way to build a scenario from it that is repeatable.
Part of what we test with it is that our system is able to detect failures. So, as we run a scenario, we’re expecting errors to be logged in Splunk alerts to be sent out to us. And so, the tool itself simply introduced failures, and then it’s up to the engineer to look for what is broken.
Ashley: Mm-hmm, mm-hmm. Very good. Tobias, how does Cthulhu fit into the strategy of xMatters?
Dunn-Krahn: So, we would—you know, our personal experience or our corporate experience of going through our own digital transformation was—and a continuing process, I don’t wanna make it sound like we’re done, we’re never done. But it was very instructive, and we learned a lot, and we’d like to share that. And many of our customers are going through the same thing, so we would like to provide them the tools that they need to be successful in these transformations.
So, Cthulhu in particular aids in the transition process more than what you would—the situation you’d be in if you were a Cloud Native company. So, we wanted to provide that support to our customers and to the community at large. We have long been users of open source software, and we wanted to contribute back to the community as well as get all the advantages of having an extended developer community contributing to Cthulhu and we can all make it better together.
Ashley: Great. So, Cthulhu is an open source project. Do you have it hosted at GitHub or your own servers? Where do people find it?
Dunn-Krahn: Gabrielle, you can confirm, but I believe it’s on GitHub.
Gasse: That’s correct, yes.
Ashley: Great. You know, sometimes open source projects are open, but largely, most of the work is done by a lead developer or someone who created the idea. Sometimes, it’s a very vibrant community where people are testing or creating new features. You know, and part of that is also not only the style of the project but also the maturity of it.
Where are you in that? Is it mostly you, Gabrielle, who’s doing the work? Do you have—what kinda activities do you have from the outside from other people?
Gasse: Yeah, the publication of Cthulhu on GitHub is fairly recent. We’ve had a fair amount of people local to Victoria when we promoted it so far that are following the project, but we have yet to see people who are really picking up or starting the contributing modules for it.
Gasse: I think one of the next additions that will really help motivate people to use Cthulhu but also contribute additional modules will be us supporting Amazon Web Services in particular.
Ashley: Mm-hmm. And how long has it been out? I should’ve asked that originally.
Gasse: It’s been out for a few months, only.
Ashley: Okay. So, still pretty early in its life. So, it’ll be interesting to see, as more people use it, that’s usually where someone will get interested and say, “Well, I wanna put some stuff in here. I wanna try to make some changes and contribute some things.” What’s Cthulhu written in?
Gasse: Cthulhu is written in Java. We’re using the Spring Boot platform, and as part of that, there is an easy way to write modules that plugs into the platform that adds functionalities. So, it’s easy to add new ways to break things in Cthulhu without having to understand the entire code base.
Ashley: What are other things that we should know about Cthulhu?
Gasse: I can talk briefly about the next item on the roadmap, if—
Ashley: That’d be great, yeah.
Ashley: Sure—love to hear that.
Gasse: Mm-hmm. So, after the support for AWS is done, my next task, if you like, is to start adding a functionality to run commands through SSH and sending files through SCP on target hosts. And that opens the door to a whole new range of tests. For example, one could pause a process on the target virtual machine to simulate a dead live or use tools like Traffic Control to introduce noise on the network, lost packets and all that.
And so, once we have that support, we’ll really be able to have lots of interesting types of failure that go beyond simply just, a virtual machine has shut down and is no longer available.
Ashley: Tobias, I appreciate very much you handing all the credit where credit is due to Gabrielle. Also appreciate your thoughts on where do you see this going from your perspective, either strategically or aligned with product plans, product strategy—what do you see as the future for Cthulhu?
Dunn-Krahn: Sure. Well, as I mentioned earlier, the xMatters product is built to help teams support digital services, and part of that is making sure not only that you plan for an incident but that you can simulate real world incidents and do some form of practice or maneuvers.
So, the way I see this being integrated into the product in the future is facilitating those types of practices and then evaluating the performance of the team or that all of the configuration in xMatters is correct to solve problems as quickly as possible, et cetera.
Ashley: Excellent. Well, you know, we’ve probably just barely scratched the surface, and I’m excited to see what you all do and what the community does with Cthulhu. I’d love to thank you both for being on the podcast. I thank Tobias Dunn-Krahn and also Gabrielle Gasse from xMatters, any additional information might be available at xMatters, is that true, Tobias?
Dunn-Krahn: That’s a good question. Yes, there’s certainly some material on the website, or you can search for it on the web if you can figure out how to spell it.
Ashley: [Laughter] Well, let me give everybody a head start, it’s C-T-H-U-L-H-U, so, just the way H. P. Lovecraft spelled it. Well, I’d like to thank both of you, Tobias and Gabrielle, for joining us today on the podcast. Time has flown by again, and hopefully, we can have you back another time as this evolves and you can tell us some more stories about where you’ve taken it and what people have done with Cthulhu.
Dunn-Krahn: Alright. Thanks very much for having us.
Ashley: You bet. Thank you. This is Mitch Ashley with DevOps.com and you’ve listened to another DevOps Chat.