DevOps Chats: Canary Deploys Using Istio, with Autodesk

Spinnaker Summit 2019 Preview: Canary deploys give us a window into how new code deploys perform in production on a limited basis. The technology holds great promise in helping us learn the positive or negative effects of deploys without putting the more extensive set of microservices, application functions and beyond at risk.

How do you route traffic? How do you apply a virtual service manifest? Is there a quick way to perform a smoke test to find server errors? When comparing metrics from new code to old, how do you know which results are good or bad? There is a lot to know and learn about how to deploy and use Canaries.

Our DevOps Chat guest Omar Al-Hayderi, engineering manager and principal engineer at Autodesk, is giving a talk on “Canary Deploys with Istio: Lessons Learned” at the Spinnaker Summit 2019. His talk is on Sunday, Nov. 17, at 1:30 PM PT.

As usual, the streaming audio is immediately below, followed by the transcript of our conversation.

Transcript

Mitch Ashley: Hi, everyone. This is Mitch Ashley with DevOps.com and you’re listening to another DevOps Chat podcast. Today, I’m joined by Omar Al-Hayderi. He is engineering manager, formerly principal engineer at Autodesk. And Omar is gonna be speaking at the Spinnaker Summit 2019 in San Diego. His talk is on Canary deploys with Istio. It’s happening on Sunday, the 17th of November, 1:30 p.m.

Omar, welcome to DevOps Chat podcast.

Omar Al-Hayderi: Thanks for having me, Mitch.

Ashley: Awesome to have you here. I’m excited to hear about your talk. But first, tell us a little bit about yourself, introduce yourself, and what you do at Autodesk. I think we know about Autodesk, maybe what part of Autodesk that you work in.

Al-Hayderi: Right, yeah. So, I’ve actually gone through a lot of change recently. I was working at a startup called PlanGrid, and we were a small construction software company four years ago, grown a lot since then. I started off at that company as sort of a backend generalist. You know, there was like 25 engineers, really, at the company, and have grown a lot since then, since the acquisition. We got purchased at about 400 employees.

In that time, I’ve sort of evolved naturally into more of a DevOps infrastructury engineer role, and—yeah, it was kinda the best four years of my life. Really learned a lot there. Post-acquisition, I moved into a principal engineer role on the infrastructure team. We then went through sort of a re-org and I became the engineering manager of the back end platform team. And now, that team, we sort of specialize in building a lot of tooling around our sort of DevOps and infrastructure cloud.

Ashley: Mm-hmm, great. Well, great experience to bring with you to Autodesk. It sounds like you’re working on some fantastic things. How did you come to decide that you’re gonna talk about Canary using Istio at the Spinnaker Summit? Was that something you chose to kinda bring in and start up at Autodesk, you joined a team already using it? What’s kinda the background on how you kinda came to this place to wanna talk about it?

Al-Hayderi: Yeah. So, I was sort of, you know, dipping my toes into Spinnaker at the early stages of it, and I was really fascinated by the tool. I loved using all sorts of features and playing around with it. And at PlanGrid, actually, before we were even acquired, we did yearly hack weeks, and we got to sort of move off of our regularly scheduled program and work on whatever we felt like, just had fun.

And one of the things that recently came out for Spinnaker was this idea of Canary deployments. So, I experimented with it, never got it off the ground, fast forward one year later to the next hack week, Kayenta, the Spinnaker Canary component, was a lot more built out and user friendly. So, I managed to get it off the ground then.

We sort of have a monolith API at PlanGrid. And so, a huge problem we had was, we would do weekly releases, and after the release, stuff would break and people would get paged and we’d have to roll back, so.

Ashley: [Laughter] Vicious cycle.

Al-Hayderi: Common story I’m sure everyone’s familiar with.

Ashley: Mm-hmm.

Al-Hayderi: So, I really wanted to find a way to solve that, and Canary just seemed like a great way to do that. So, after the hack week, I sort of kept working on it full time, and I started doing, you know, presentations to the rest of the company. Because Canary is sort of this new idea that a lot of people aren’t familiar with. I started, you know, being active in the Spinnaker Slack channels, and it actually turns out that an ex-PlanGrid employee moved to Netflix to be a lead on the Spinnaker open source software from there.

Ashley: Ah, interesting.

Al-Hayderi: Yeah.

Ashley: Small world.

Al-Hayderi: Yeah. [Laughter] And he noticed I was talking a lot about it and he brought up the point that they’re running the Spinnaker Summit in November and he asked if I wanted to talk. And I’ve actually never given a talk at a conference.

Ashley: Oh, excellent.

Al-Hayderi: Yeah, and I do a lot of talks internally. So, this is something I was super excited about and, yeah, I jumped on it.

Ashley: Great thing to do for your career, too, and it sounds like those internal talks have set you up well to do this.

Al-Hayderi: Yeah.

Ashley: So, how did you go about deciding, is this gonna be kind of an overview of how to do it? Is it going to be, “Here’s the metrics to watch” and the value that comes out of it is a combination of that? Tell us a little bit about what part of this you’re—kinda what take you’re taking in your talk.

Al-Hayderi: Yeah. So, public speaking was honestly never my strong suit when I was younger. And some great advice I got was that, when you’re giving a speech to a bunch of people, you wanna tell it like a story.

Ashley: Mm-hmm.

Al-Hayderi: So, since then, I’ve really focused on storytelling for even these tech talks I give internally.

So, the story I’m trying to tell here is how we leveraged Istio to finally solve some of the deep technical problems of running Canaries.

Ashley: Mm-hmm.

Al-Hayderi: And once we got it out there, sort of the growing pains and, you know, first obstacles we hit and problems we’ve got. I’m hoping people come out of this with a better idea of how to actually get this running in your infrastructure and some common pitfalls to avoid.

Ashley: Mm-hmm. Great. Well, tell us a little bit of that story. Kinda get us started down that path.

Al-Hayderi: Totally. So, the first thing I did was, I went in and learned exactly what Canaries are. And the basic point of it is that you want to expose a small amount of traffic to something new. You want to analyze the deltas and some sort of metrics or something between the new and the old. And then you want to make a decision if this new is ready for the majority of traffic.

Ashley: Mm-hmm.

Al-Hayderi: So, there’s a lot that goes into that. Anyone that’s worked with Spinnaker on production scale knows that managing all these pipeline definitions is very cumbersome, and trying to, if you end up changing something, especially changing production pipelines, it can be really dangerous.

So, a lot of the things we learned about our, you know, how we tested this, even just testing the Canary process in our sort of R&D Dev staging environments and the problem there, even though we got confident with how the pipelines worked, it was very difficult to get a signal on how to actually tune your Canaries. And when I say tune, I mean, when you’re comparing metrics from the new and the old, how do you know if it’s bad or how do you know if it’s a good decision to move forward? What are the thresholds, what metrics are you looking at? A lot of our iterations were focused on that.

Ashley: Interesting. Now, I’m curious if—are you approaching this problem of using Canary to understand the service mesh or the services itself, or are you trying to also understand kinda Istio and it acting as a sidecar proxy and how that’s performing coordinating traffic load balancing across that? Is it one or the other or both, or what part of the problem were you looking to solve?

Al-Hayderi: Right. So, actually, we were kind of blocked on the fact that we couldn’t route traffic, you know, with small granularity coming in from the Internet. And we actually only use Istio for its API gateway feature at the moment.

Ashley: Oh, okay, okay.

Al-Hayderi: Yeah, there are some blockers right now, at least in our version of Istio, that we’re running around connecting over SSL to Redis. So, that’s blocked having sidecars or application pods. So, when we do canaries, we’re just canarying end user or ingress traffic into our monolith.

Ashley: Okay.

Al-Hayderi: Yeah.

Ashley: Interesting. So, tell us the story. How did you, you went about implementing it—what did you learn from it?

Al-Hayderi: So, I mean, first off, we had our sort of snowflake-y pipeline that would deploy our own little service and change traffic and compare metrics. So, I guess the first part about it was how do we route traffic and what we actually do is, in a single pipeline, we’ll deploy containers to a new supergroup and we’ll, that’s using the V1 provider of Kubernetes.

Ashley: Mm-hmm.

Al-Hayderi: And then we actually used the V2 provider of Kubernetes to apply a virtual service manifest. Virtual services are just a kind of way of routing ingress traffic based on some set of rules. So, if the host header matches this and the path has this Regex, route it to this service.

What you can also do is, when you match a rule based on some ____ path prefix is route to several services based on weights. So, we first experimented with how are we gonna get a small amount of traffic, yet a meaningful enough amount of traffic to get a good enough sample size of data.

So, a lot of our, at the start, what we were doing was just sort of like load testing and seeing, “Okay, is 5 percent of the traffic to the Canary deployment enough? How about 10?” et cetera, et cetera. We also looked at some common best practices around Canaries and how they recommend running baseline deployments. So, what we would actually do is deploy two new deployments. So, one with the new code on a small sized server and one with the old code on a brand new, deployed small server. That way, you sort of take away a lot of variables from the experiment, like long running processes, et cetera.

So, we got there, got traffic routing correctly. Now, the next part we had to do was metrics, right? How do we analyze the delta between the two deployments? At this point, Kayenta only supported Datadog, at least out of the metrics providers that we used.

Ashley: Okay.

Al-Hayderi: So, what we really wanted to do, you know, going back to the whole point of getting Canaries out was stop deployments that are just plain broken, right? So, kind of getting, like, a cheap, easy way to get smoke tests. And so, we were just looking for server errors.

So, as we were trying that, the problem is, is that the server error metric that Istio gives you will only send metrics of a server error happens. So, between two deployments, if you only get two server errors on one, that will trip the Canary as a failure. It doesn’t give you the option to aggregate metrics, so we couldn’t do things like check error rates. So, that just really never worked.

And actually, in my talk, I go into sort of the—this was kind of a big, big problem the first time we actually released this into production is that Canaries would just fail a lot for no reason, right? And Canaries running for an hour long, you know, our release engineer is sitting there waiting for an hour and then has to trigger it again just because, you know, one server error happened on the Canary deployment.

Ashley: Well, it sounds like maybe one of the lessons you learned or concept you came up with is, don’t cast a big net. Start out with sort of higher, larger grain, most fundamental issues, kind of a ____ kinda concept. Start there, get those—get that quality fixed, improved, so you’ve got those things working better, probably tighten the net a little bit more and filter out some more specific things that are causing issues. Does that sound like the approach that you learned how to take?

Al-Hayderi: That’s exactly it, and actually, what I would recommend is, the first time you give out Canaries, don’t have it actually fail or roll back the deployment if the Canary was unsuccessful.

Ashley: Mm-hmm, okay.

Al-Hayderi: What we actually ended up doing was actually running this for a month and just gathering data, seeing which metrics worked, which metrics didn’t, and mapping up Canary reports with what we actually saw as problems in production. That way, that gave us more information on which metrics to use. We ended up going with average latency and 95 P latency and once that was out, we started actually aborting and rolling back deployments and then we started to get a lot of great value of this tool.

Ashley: Mm-hmm. Yeah, that’s actually very similar to kind of a quality process, right? You don’t solve all problems at once. You start, you prioritize, you filter, you work done—you know, improve it one step at a time, rather than just throwing a canary out there and seeing what happens, which sounds like you’ll get results, but you don’t know what or why, right?

Al-Hayderi: Yep, exactly.

Ashley: You’ve gotta dig into it and be a little more thoughtful about it, it sounds like.

Al-Hayderi: Yep.

Ashley: Good. Anything else? Any kinda other big things you’re expecting to talk about as part of this?

Al-Hayderi: So, another big thing we learned was, even after this, when we had, you know we were confident that our Canaries were failing when they should and passing when they should. The problem we got was, when a Canary failed, people didn’t really know what to do.

Ashley: Mm-hmm.

Al-Hayderi: Because, you know, when a deployment fails, we have a lot of data and we have a lot of observability into that. The Canary is getting a lot less traffic, and it’s hard to know exactly what failed just by looking at the Spinnaker UI Canary report.

Ashley: Mm-hmm.

Al-Hayderi: So, we then, you know, sort of invested more into our—we used New Relic for our APM and we piped our Canary into that so that we could actually see transaction data and the delta between it and the baseline metrics to see, you know, which end point, for instance, was misbehaving or was some new database core utilizing this—that helped us a lot, too.

Ashley: Mm-hmm. Interesting. Well, it sounds like you’re using—think about this in a systemic way, not just Canaries and Spinnaker, but also you mentioned Datadog, New Relic, a number of tools that you’re using as part of your environment, which all have to come together to figure it out, right? It’s not just one thing.

Al-Hayderi: Yep.

Ashley: Well, good. I wish you the best in your talk. The folks that would go to this talk, are they gonna be, tend to be software developers, engineers, architects, operations focused on the DevOps team? Who tends to be more interested in this than others?

Al-Hayderi: I would say more the DevOps operational people. People who work with the Spinnaker pipeline definitions at their company would get the most value out of this. But, you know, back when I was just a developer, I heard a talk about Canaries, and that sort of triggered me to bring this in. So, I mean, anyone who’s interested in resiliency and, you know, release confidence would get value out of this talk.

Ashley: Okay, excellent. Very good. Well, thank you for being on the podcast, Omar.

Al-Hayderi: Thank you very much, Mitch.

Ashley: It’s been great to have you, Omar Al-Hayderi, who is engineering manager, formerly a principal engineer, at Autodesk. He’s gonna be speaking at the Spinnaker Summit 2019 in San Diego. The dates for that is November 17th through the 19th, and Omar’s talk is on Sunday, the 17th, at 1:30 p.m.

So, thank you, all of you, for joining us and listening to this episode of DevOps Chat podcast. This is Mitch Ashley with DevOps.com. You’ve listened to another DevOps Chat. Be careful out there.

— Mitchell Ashley