DevOps Chat: Monitoring Spinnaker on GKE with Miles Matthias

On a project to move one of Google’s Fortune 500 customers to Google Cloud, Google Kubernetes Engine (GKE) and Spinnaker open source, our DevOps Chat guest ran into the message “hang tight.” No scripts or documentation. Not to be delayed, Miles Matthias, Google Cloud Consultant with Container Heroes, filled the gap and contributed his work back to the Spinnaker community.

This episode of DevOps Chats features a preview of Mile’s talk, “Monitoring Spinnaker with Prometheus Operator on GKE” that he will be giving on Saturday, November 16th, 3:45 pm PT, at Spinnaker Summit 2019. Miles also talks quite a bit about Canary testing in this episode.

As usual, the streaming audio is immediately below, followed by the transcript of our conversation.

Transcript

Mitch Ashley: Hi, everyone. This is Mitch Ashley with DevOps.com, and you’re listening to another DevOps Chat podcast. Today, I’m joined by Miles Matthias, he’s a Google Cloud consultant and he’s with Container Heroes—great company name. Our topic today is actually a preview of his talk at the Spinnaker Summit 2019. His topic is gonna be Monitoring Spinnaker with Prometheus Operator in a GKE Environment, Google Kubernetes Engine. His talk is on Saturday, November 16th at 3:45. Miles, welcome to DevOps Chat.

Matthias: Hey, Mitch. Great to be here. I’m really excited about the Spinnaker Summit.

Ashley: Well, tell us a little bit about you, what you do and a little bit about Container Heroes.

Matthias: Yeah, sure. So, like you said, at Crowd Consultant, I’ve done a whole bunch of software development and infrastructure setup and architecture design throughout my career and now I’m helping other companies set up their cloud infrastructure, make migrations, modernize their application development and all sorts of things like that. And I’m a partner at ContainerHeroes.com, and we’re a group of kind of people like me that have been in previous startups, CTOs, who’ve done the whole PC route, we’ve built a bunch of custom applications for clients throughout out career and now we enjoy consulting people that are using the cloud.

For the past year or so, especially I guess the last nine months or so, one of our big clients, we’ve been helping Google with one of their new customers that was complete on prem to cloud, to Kubernetes containerization migration and they utilized Spinnaker. So, I kinda was brought in as the Spinnaker expert to help them get it up and running, to help customize what they needed in their environment and contributed as much as they could back to the open source, including some stuff that I built for using Prometheus Operator, which we can touch on a bit. And so much so that then the team invited me to give a talk at the conference, so I’m excited about it.

Ashley: Sounds like a very—very relevant talk, especially sharing your experience from working with this Google customer.

Matthias: Yeah.

Ashley: So, tell us a little bit about when you came to this project and first started working with the customer. Had you worked with Spinnaker before, was that a new technology, or were any of these new technologies, or you’ve kinda done all this before?

Matthias: So, I’ve done a bunch of CI/CD before. Spinnaker is still in its early days, so I have very limited experience with Spinnaker as an application. It’s still—yeah, it’s still very nascent on the development scene. The entire kinda concept of continuous delivery is still very new in the industry, even though it’s been around for a little while.

And so, CI has been very well developed, everybody knows Jenkins, everybody has ____, a million CI tools, content of continuous delivery and having more thought out, sophisticated delivery strategies that can help you do more automated continuous delivery like canary analysis and different deployment types and automated rollbacks and all these sorts of things are kind of brand new and Spinnaker obviously really helps with a lot of that stuff.

Yeah, I had some experience, but was really excited to jump in and kinda get into something that’s really taking off in the industry.

Ashley: Tell us a little bit about the app that you were challenged to work on. I understand this was already, had been developed with Jenkins and Spinnaker before and then you were moving it to GKE, or what was it?

Matthias: No. So, the client had 50 different job applications, a bunch of different things, and it was all running on prem. So, part of the big effort was to obviously move them onto Google Cloud, but also to help them develop a CI/CD process that allowed them to make a commit in their repo, have artifacts be built in CI, have tests run and then have those passed to Spinnaker and CD and then deployed to GCP.

They had no experience with Spinnaker and didn’t have any kind of CD solution setup, really, especially for the cloud. Like I said, they were completely new to the cloud, so this is all new stuff for them, so I helped them set it up, install it, configure the options all we wanted, manage it and then introduce more and more advanced features as the project got more and more migrated over to GCP.

Ashley: Mm-hmm. So, it sounds like maybe a more traditional Java application environment maybe is not even continuous integration yet. Were they down the path with that?

Matthias: Yeah, they had Jenkins on prem.

Ashley: They did?

Matthias: And so we—you know, yeah. So, the architecture was moving everything to Kubernetes and so, you know, one of the GCP projects has GKE cluster just for CI/CD. So, running Jenkins and running Spinnaker on it. So, they had Jenkins on prem, and moved that over to the cloud, hosted it on Kubernetes, and then we installed Spinnaker right alongside it on the same GKE cluster, right?

Ashley: Mm-hmm.

Matthias: And then used those tools as the basis of their CI/CD pipelines to deploy to other clusters within the cloud for them.

Ashley: How about the size of the application? How would you quantify how kind of big or extensive it was?

Matthias: Like I said, they had a—this was a true microservices organization. So, you know, they have 30 to 50 different applications as microservices, each individually being deployed. As far as traffic, I mean, so the applications, like I said, are microservices, so they’re pretty small in and of themselves individually.

As far as traffic and resource usage, it’s a very, very large, large client. Like, pushing the boundaries of some of the largest things we’ve seen.

Ashley: Hmm. Okay.

Matthias: So, that was really exciting—really cool.

Ashley: This was a Fortune 500 company, I understand.

Matthias: Yeah.

Ashley: We’re not talking about the specific company, but—

Matthias: Sure, yeah, yeah, yeah.

Ashley: Well, good. Well, what are the sort of things that you’re planning on talking about, then? I mean, I know this is about the monitoring aspect of it, so you know, are you touching on Prometheus Operator and probably how to configure it or how you decide it and what kinds of groupings to create, what kind of rules and alerts and those kind of things? Are you gonna be talking more fundamental architecture? What are your thoughts?

Matthias: Kind of a bit on everything, because this kind of—this is kind of some required setup in order to do monitoring and in order to do canary analysis, even, if you’re gonna use Prometheus for canary analysis metrics. And it’s also just fundamental to how you wanna have this setup if you’re running Spinnaker on Kubernetes in general.

So, a little background on Spinnaker as a project, which I’m sure other people that are familiar with Spinnaker definitely know, you probably know yourself. Spinnaker was originally created by Netflix. Netflix is still very much a VM shop, right? So, they don’t use Kubernetes, and everything runs on individual instances. And so, Spinnaker can be deployed to run on Kubernetes and that support is there, it has support to deploy applications to other Kubernetes clusters.

But there are still a few things like monitoring, like canary analysis where, in the Kubernetes world, we do things a little differently, and Prometheus is one of those examples, right? In the VM world, you spin up Spinnaker, you spin up one Prometheus instance—and Prometheus obviously is a metric store to collect metrics and then usually is paired with Grafana and ____ and things like that in order to—all CNCF open source projects in order to give you graphs on these metrics.

And so, there was previously, in the VM world of Spinnaker, running it as Netflix does, a microservice of Spinnaker that does monitoring, that connects to all of the different components of Spinnaker, listens for their metrics and then reports it to Prometheus. Cool. Like, it looks great. They even have some dashboards.

So, Grafana dashboards, if you’ve ever worked with those you can just upload some JSON files for your dashboard and then you click there and then you’re like, “Hey, look at these dashboards!” You can get real time dashboards based on the Prometheus metrics of Spinnaker. And Spinnaker, all the different components that it’s made up of and the different metric based on the things that it’s doing so that you can see in there when it’s processing your deployment, here are some of the metrics that are coming off, right?

Ashley: Mm-hmm.

Matthias: So, all of that support was there. When you want to run Spinnaker on Kubernetes, though, that support was a little…not so much. [Laughter]

Ashley: Mm-hmm.

Matthias: And that’s the trouble that we kinda ran into. So, the thing that was there was the concept of, “Hey, we have this monitoring component in Spinnaker,” and if you enable monitoring, that monitoring component will be enabled as a sidecar container to every single pod, every single microservice that Spinnaker is composed of, and it knows how to talk to that microservice and get the metrics from it and then report it somewhere.

Cool—good. Okay, I can get that. What about the monitoring service then having those metrics that are just collected from all the different components and putting it somewhere? Ideally Prometheus, you can also do Stackdriver or, I believe there might be Datadog support now, but mainly, the two that we see used a lot, Stackdriver and Prometheus.

Ashley: Mm-hmm.

Matthias: However, in the Kubernetes world again, especially when you have a bunch of different Kubernetes clusters, one of the tools that you end up using is called Prometheus Operator. And that is essentially a Kubernetes operator that, when you apply to your cluster, will automatically install for you Prometheus Grafana, an alert manager as pods, deployments running on your cluster. It’s a way to provision your infrastructure so that when you deploy a bunch of applications in these clusters, you already have some monitoring infrastructure set up, right? I can deploy and I can basically go to my organization and say, “Give me a new cluster and then I can deploy my application and start admitting Prometheus metrics and there’s already a Prometheus instance on the cluster, so like, I’m good to go.”

In order to get the Prometheus, the version of Prometheus that the Prometheus operator creates to then go and fetch the metrics from the Spinnaker monitoring component that already existed and already knows how to get those metrics from the Spinnaker different components, you have to kinda connect those two ends, right? And in the Prometheus Operator, the Kubernetes world, basically, you just apply Kubernetes manifest, the CRD called the Service Monitor, and that basically tells your instance of Prometheus on the cluster, “Hey, go fetch the metrics from these places,” right? And so, it’s pretty easy, but there was no setup way in Spinnaker. So, you would read the documentation in Spinnaker and they would have very detailed, very nice—like, if you’re running Spinnaker on VMs and you want to do monitoring and you wanna use Prometheus, just use this fancy and nice, easy setup script, right?

Ashley: Mm-hmm.

Matthias: Super nice. Great. And then literally in the documentation, it said, “If you’re running on Kubernetes. Hang tight, support is coming.” [Laughter]

Ashley: [Laughter] The “hang tight” documentation feature. Wonderful!

Matthias: Hang tight—yeah, right. So, none of this is extremely complicated. There are a lot of different components and you have to be careful about how you set them up, right? But the very nice setup script that was available for just running plain old Prometheus on VMs was very nice and like I said, it installed a bunch of Grafana dashboards for you and did a bunch of nice stuff for you. There was no such thing for Kubernetes, right?

Ashley: Mm-hmm, mm-hmm. Kinda starting from scratch. Not completely, but at least to take that next step.

Matthias: Exactly, exactly, exactly. And so, what I did and what I contributed to the project was creating a setup script for this exactly, saying, “Hey, you’ve got a cluster, you’ve got Spinnaker already installed on there, you’ve got Prometheus Operator installed on there and you’ve got a Prometheus Operator installed on probably a bunch of other clusters, too. Here’s a setup script that connects Prometheus instance to Spinnaker so that Prometheus can go and fetch those metrics about how Spinnaker is running and how Spinnaker is handling your deployments.”

Ashley: Mm-hmm.

Matthias: And installs a bunch of these Grafana dashboards that the open source project has already created for us to be able to say, “Here’s a Grafana dashboard about each microservice—Clouddriver, Gate, Orca, all the others, they each have their own dashboard in Grafana that you can go look at and they’re preconfigured and it’s really nice, it’s really neat.” Again, those just had to be kind of converted into the way you apply these in the Prometheus Operator world which, again, is applying your Kubernetes manifest to say, “Hey, it’s actually just a config map with a certain label on it.” “Hey, here’s a dashboard,” and when you apply that, the Prometheus Operator knows how to hook into the Kubernetes master and goes, “Hey, there’s a new config map that is a Grafana dashboard. Let me go fetch it and add it to Grafana for you,” right? That’s the whole purpose of Prometheus Operator.

So, that’s kind of the thing that I set up was saying, was giving you—so now, and then I PR it to the documentation. So, now, when you go look at the documentation, it says, “Hey, if you’re running Spinnaker and Prometheus Operator on Kubernetes, there’s also a fancy setup script for you, too!” [Laughter]

Ashley: There you go. Now people will have to start at the same place you did.

Matthias: Yeah, really.

Ashley: So, are your planning, on your talk, actually walking through this, the process of the script you built and how to set all this up, get the dashboards to come up, et cetera, or you’ve kinda taken a little different approach?

Matthias: I don’t want to anger the demo gods, so I’m not sure if I’ll actually be, you know, typing it, running the actual scripts, but I can certainly have something set up that’s like, “Hey, this is the end result. Let’s step through what the script is actually doing to help you understand what components need to be matched up where, and then here’s the end result,” right?

The other kinda couple things that I’ll probably touch on in the talk are—which is why it’s important. So, all of that is monitoring Spinnaker as an application. So, like, your DevOps or your Release Engineering team, people that are involved in these things who are gonna monitor Spinnaker as an application and are gonna see, “Oh, Orca is overwhelmed and it has a bunch of tasks in its queue that it can’t pop off, so let’s add some more replicas of Orca” or something like that.

That’s what this script helps you set up is this dashboard to see the performance, to see those metrics, right?

Ashley: Mm-hmm, mm-hmm.

Matthias: And other reasons that it’s important are, and the other things I’ll touch on—well, first of all, the one thing I’ll probably touch on is that there’s kind of an effort currently in the works and I hope by the time of the talk, maybe there’s a little more effort built towards it, but there isn’t really a whole lot of guidance in the Spinnaker community about, “Okay, here are the metrics we emit. Here’s what they mean, and if they go outside normal ranges, here’s what you should do to remedy that situation.”

Ashley: Mm-hmm.

Matthias: There are no kind of run books, as you might call them, for how to respond to a Spinnaker instance that is under a really extreme load or kinda gone configure sideways or something like that.

Ashley: Interesting.

Matthias: Rob Zyner kinda leads the Netflix effort. He’s done a great blog post about some of the metrics and things like that and what each of them mean. But that’s kind of the only resource out there. So, there is kind of this effort in the community to say, “Hey, let’s put together some actual run book and match some of these dashboards that we have in Grafana and some of these metrics and be able to tell people, ‘Hey, if you run into this situation, here are the step 1, 2, 3 to check, and if these are the case, then here are the three options for alleviate that problem or how to address the problem,” right?

Ashley: Mm-hmm.

Matthias: So, hopefully, by the talk in November, we might have some progress on that in the community.

The other thing that I’ll probably talk about on the topic and why the concept of the Prometheus Operator is so important in the Kubernetes world is for canarying.

Ashley: Say a little bit about what canarying is with your deployment processes.

Matthias: Yeah, sure. Sure, so, canary testing is—it’s a form of testing that evaluates one of your release candidates. It compares it to—what it does is, it actually runs your candidate, and it runs the version that is running in production, starts a new copy of it, and it runs it side by side. So, then it is—and then it diverts usually a small amount of traffic or just, you know, will have 100 pods behind a service and you’ll add 2 pods so they’ll get a proportional small amount of traffic. And you’ll run it and you’ll see how it performs. You’ll listen to the metrics and you’ll see, “Oh, this is how your candidate that you want to push performed.” And in canarying, in the configurations in Spinnaker, you can say, “These are the metrics I care about when evaluating, when deciding if it performed well and if it performed badly.”

For instance, let’s say you have a new version of the application that just, like, 500—500 errors out on every request, right? You run a bunch of unit tests, you run a bunch of integration tests and ideally, at that point, at one of those steps or earlier, you would’ve caught that, right? But tests are written by people and sometimes people don’t write them or sometimes some situation that depends on live traffic and live data actually exposes some types of problems, right?

So, let’s hypothetically pretend you have a new version that comes out and it’s just 500s all over the place. What canarying says, the kind of theory behind it is—let’s actually run it on a small amount of traffic, let’s see what it does. And then canarying would go, “Oh, hey, your release version that you wanna push out? It’s just doing a bunch of 500s, and you told me in your configuration, if you see so many, a spike in 500s, that that’s bad, that we don’t wanna actually then deploy that to all 100 instances running in production,” right?

That canarying says, “Hey, that’s not a good thing to promote,” right? Like, “I gave it a chance, I ran it on some things, I compared it to what the version that is currently running or our new instance of the version that’s currently running, and it didn’t do well. So, you know what? I’m gonna error out, I’m gonna show you this is how it behaved, you guys go fix it, and then when I come back, I’ll run those things again. And if it performs well, then great, we can release the new version.”

And that’s kind of one of the main components that you need for continuous delivery. Because you can imagine a world where developers are just committing, committing, committing, committing.

Ashley: Well, yeah.

Matthias: They do a bunch of tests, they run some tests, to them it all looks good. And then you just have this other automatic component over here that’s doing canary analysis and that can just test it, it can actually put it into production, give it a little bit of traffic, see how it performs, and if it performs well, then let it go, you know? Promote it, right?

Ashley: Yeah, that’s extremely useful.

Matthias: Deploy on a Friday, deploy on a Saturday—who cares, right? It’s actually seeing how it’s running, all the other tests before it’s even gotten to that point have passed, obviously, and if it runs well, then run. And this is what enables organizations like Google, Netflix, and others to deploy thousands of times a day, because it has this automated system to go in and say, “Give it some traffic, see how it runs, and if it runs fine because of the way we configured how we told the system what does it mean to run fine, right? Great—then let it go.”

Ashley: We’re kinda running up against our time, here. I do have one last question.

Matthias: Sure.

Ashley: What kind of folks should come to your talk? Obviously, it seems like developers, people who are working on the CI/CD pipeline, maybe DevOps engineers. Are there other folks you’d recommend, or are those the right folks?

Matthias: Those are definitely all the right folks. I think if you’re interested in Spinnaker or you’re currently using Spinnaker where you are, want to know about monitoring it as an application and managing it as an application when it experiences a bunch of load and how you want to connect all the dots if you’re running it on Kubernetes and you’re running multiple other Kubernetes clusters that also are using Prometheus Operator, this is a great talk to go listen to.

Ashley: Again, Miles is a Google Cloud Consultant with ContainerHeroes.com. He’s speaking at the Spinnaker Summit 2019 in San Diego. Now, that conference is the 15th through the 19th of November. His talk, again, is on monitoring Spinnaker with Prometheus Operator on GKE on Saturday the 16th at 3:45. Thank you, everyone, for joining us. You’ve listened to another DevOps Chat. This is Mitch Ashley with DevOps.com. Be careful out there.

— Mitchell Ashley