DevOps Chats: Debugging Spinnaker Apps, With Salesforce

Spinnaker Summit 2019 Preview: Debugging production issues in any environment can be challenging, and Spinnaker has its production learning curve. Problems aren’t always replicable in a smaller environment, and debug messages can be verbose and confusing to triage what’s happening.

Our DevOps Chat guest Chuck Lane, Salesforce lead software engineer, is giving a talk on “Debugging and Profiling Spinnaker Applications Live” at the Spinnaker Summit 2019. In Chuck’s talk, you’ll learn skills like remote JVM debugging, custom profiling builds and the magic of figuring out what’s going on with a multithreaded microservice using htop.

Chuck’s talk is on Saturday, Nov. 16, at 3:40 PM PT. Spinnaker Summit 2019 is Nov. 15-19 in San Diego.

As usual, the streaming audio is immediately below, followed by the transcript of our conversation.

Transcript

Mitch Ashley: Hi, everyone. This is Mitch Ashley with DevOps.com and you’re listening to another DevOps Chat podcast. Today, I’m joined by Chuck Lane. He’s a lead software engineer with Salesforce.com. He works on Sales Cloud, which is the Salesforce you and I use, know and love, or our CRM capabilities. Chuck is talking at the Spinnaker Summit 2019 in San Diego, and his talk is about debugging and profiling Spinnaker applications live. And I think live means, like, live while they’re running, in production, and see what’s going on. We’ll explore what that is. This talk is on Saturday, November 16th, at 3:45 p.m. in San Diego. Chuck, welcome to DevOps Chat.

Chuck Lane: Thank you. It’s nice to be here.

Ashley: Excellent. Awesome to have you here. Would you start by just introducing yourself? Tell us a little about you and what you do at Salesforce.

Lane: Yeah. So, my name is Chuck Lane. I am, as you mentioned, a lead software engineer. And basically, my chief responsibility at Salesforce is to help bring Salesforce into using Spinnaker for our public cloud based deployments. So, as we make a transition to try to move away from first party architecture and move over into the public cloud, one of the technologies that we’re using to do that is Spinnaker and I’ve established myself as a subject matter expert in Spinnaker. And so, I kinda help to bridge the gap between the traditional development and deployment cycle and what that looks like using Spinnaker in a containerized world.

Ashley: Okay, great. You used a term there I don’t know if I’m familiar with. You mentioned something about moving from first party architecture to the public cloud. What’s that mean about the environment you’ve been in that you’re, and what you’re moving to?

Lane: Yeah. So, basically, what I mean there is that, historically, Salesforce has owned their own data centers, and we’ve deployed to those data centers. We are moving, embracing in a big way public cloud architecture, whether that’s through AWS, whether that’s through GCP, or any of the number, any other number of cloud offerings.

And so, as opposed to doing what some people would call a lift and shift where you just take your software that’s meant for, to be run on hardware that you own and move it into the cloud, we’re doing a fundamental re-architecture of the software to make use of all of the great things that cloud architecture allows us to take advantage of.

Ashley: Mm-hmm. Fantastic. That’s interesting to hear a little bit about your own evolution at Salesforce. So, your talk is about using Spinnaker, you know, really about debugging and finding out problems that are happening in production, but sometimes you might struggle with trying to replicate it into a smaller environment to get the same problems working.

Can you tell us a little bit about what sort of led you down this path to figure out how to do this? Was it a big problem that was happening or something you saw repeated over and over that there wasn’t a good solution to and you figured out how to do that? How did this come about?

Lane: Yes. So, basically, as you’re taking workloads and migrating them over, I mean, when—you know, when you just start with a small subset of workloads, you know, the path is relatively straightforward. And, you know—hopefully, anyways—and you don’t run into too many issues that can’t kind of easily be solved.

But the reality is, as you transfer more and more workloads over to the public cloud, that’s—you know, the devil’s in the details, right? So, that’s really where things can pop up where you’re hitting various limits or software isn’t performing in the way that you would expect it to perform. And these are the situations where we—where it can be very beneficial to jump into production software and really, you know, sometimes slap a debugger on it and see exactly what it is that it’s doing that differs from kinda what you expect. You know, and that can help you kinda tailor your workloads in such a way that you can make things run more smoothly.

Ashley: Mm-hmm. I know sometimes using debuggers, enabling them—that can be helpful, but it also can be too much information, trying to sort through what all is happening, trying to find what should you be looking at. What are some of the challenges you’ve found by turning on debugging or using debuggers?

Lane: Yeah. Well, so, I mean, there definitely is that problem that you’re saying just as far as too much information that’s coming out. One of the nice things about Spinnaker and the way the services are written is, you have a very fine grained tuning over which libraries inside of Spinnaker you tell it to print out and debug information.

Ashley: Mm-hmm.

Lane: So, if we know that there is a hiccup with one of the data binding layers or a hiccup with one of the authorization layers, then we can, through the config files, we can—and Java settings—we can really target that explicit directory, or I’m sorry, that explicit library and say, “Give me all the debug information that you have.”

Ashley: Mm-hmm.

Lane: But, you know, failing that, I mean, there are definitely times when it’s been advantageous and the best course of action is just to go down and actually look at the code and see what it’s doing. And, again there, it’s best to do that under—you can’t do it in production, usually, but what we can do is simulate a load that’s similar to what we would see in production and really, really take a look at what’s going on to help us identify those algorithms that might be o event squared instead of a login or something like that.

Ashley: Mm-hmm. Now, are there some specific techniques—I know in your description of your talk, when you talked about using remote JVM debugging, custom profile builds, even using htop, you know, a UNIX command or a LINUX command to help you figure out what’s happening with multi-threaded microservices. It sounds like a variety of different approaches that you’ve kind uncovered and learned that you can use to figure out what’s happening.

Lane: Yeah. And so, I mean, in general, we like to run as closely to the open source build as we can, the ones that are provided by Spinnaker, which is ultimately provided by Google. But there definitely are circumstances where we need an extra tool, be it htop, be it something like Glowroot, which is a Java application profiler where what we’ll do, then, is we’ll go in and we’ll build a custom image using the Spinnaker images as a base and then tack on those additional libraries and put them into the Spinnaker ecosystem and then launch them up just to see what additional information we can get out of there.

And so, you know, one scenario where that was really useful to us, we ran into a scenario where cloud driver, which is—cloud driver is the main tool that talks back and forth to all of our different cloud infrastructures.

Ashley: Mm-hmm, mm-hmm.

Lane: Sometimes calls to cloud driver were taking upwards of, well, two to three minutes to respond. Now, you know, timeouts are set in such a way that if it doesn’t hear anything back in about 30 or 60 seconds, then it just disregards that load and, you know, so that created quite a bit of problems for us.

By building a custom cloud driver image that had htop in it, we were able to see that the majority of the processes that were running were actually running basically commands to reach out to the Kubernetes clusters and get a list of all of the name spaces that are in the Kubernetes cluster. And, talking it over with some of the Spinnaker developers, what we found is that if you have 15 or so clusters that you’re connecting to Spinnaker, then that’s not so much of a problem to go and query each of those to get the list of name spaces. But as you scale up, on the order of 400 or 500 different clusters the way that we do, then a lot of the delays and a lot of your time can be just essentially Kubectl calls waiting to return to you the lists of name spaces that you need to go and scan.

So, we implemented a workflow based solution for that, basically, where we let our teams know that when they’re creating their clusters, they should use Terraform to go ahead and create the clusters—I’m sorry, the name spaces that they’ll be using. And then, if they need to use a name space after the fact, then we provide a pipeline that they can use that will dynamically add in an additional name space that their cluster will start to scan. So, that saves us from scanning all the clusters all the time, which was causing quite a bit of performance bottlenecks.

Ashley: Yeah, I would imagine that would have some overhead, maybe a lot of overhead if you’re doing that frequently. So, it sounds like a way to both reduce overhead, but also to get the information faster.

Lane: Exactly, exactly—yep, yep.

Ashley: Mm-hmm. Very cool. So, in your talk, are you going to be doing any demos? I know there’s always the demo Gremlin, or are you gonna be just showing, you know, talking about what some of these techniques are?

Lane: So, I plan on doing some demos. As far as—

Ashley: That’s really cool.

Lane: Yeah. You know, I may do the, what’s the—the Easy Bake Oven a little bit as far as, “Here’s the behavior and here’s what we’ve found in code.” But from what I understand, a lot of that stuff can kinda be dependent on Internet connectivity at the actual site. So, if it’s something where we have good Internet connectivity, then by all means, I plan on walking through a couple of debugging scenarios, as close to what we would do in real time as possible.

Ashley: Yep. Well, you know, someone somewhere—it’s been a while ago—gave me the great advice of, you know, “Have your live demo and have your disconnected demo ready in the background.” [Laughter] So, you can always at least show something locally if you can’t get on the net. So, if you’re depending upon the network—I’m not sure if you are for your demo, but always a good lesson, right?

Lane: Sure, exactly.

Ashley: Great. Are there any other kind of lessons learned, common mistakes, or mistakes that you or others might have made along the way, kinda so there’s hard things you learned by trial and error that you plan on sharing?

Lane: I mean, well, there are—whew. You know, I’ve been working on Spinnaker for a couple of years. So, there’s definitely been a lot of hard lessons learned, here. But honestly, what I would do is, I would encourage anybody who wants to get into Spinnaker to not just hang around the Slack channel, because the Slack channel does have a tendency to get overrun with, you know, just kinda people posting their stack traces and just saying, “Hey, has anybody ever seen this before?”

And, you know, I’ve gotten a lot more success by going through the commits, looking at the people who actually authored the code, and then reaching out to them directly with more than, “Hey, can you explain this to me?” but rather, you know, “Hey, I see what you did here and I see what you did here. I’m running into problems with these lines. Do you have a different approach, or is there something that I can be doing differently?”

And the other thing that I just can’t overstate is the value of being a member of one of the special interest groups.

Ashley: Hmm, interesting.

Lane: So—yeah. So, I’m a member of the Kubernetes V2 special interest group that’s lead by Eric Semene and Ethan Rogers from Armory, Eric’s from Google. And it has, you know, it has been just an absolute wealth of information and, you know, honestly, I don’t know if we ever would’ve gotten nearly as far as we had without those two people.

So, you know, yeah, I would just say that the community is really friendly and we always welcome new members who are ready to learn.

Ashley: You know, I think both of those are great suggestions, and I really appreciate that you made those. Because, one, things like the Slack channels on projects, open source, those can be a bit intimidating. Sometimes they’re not approachable, because there’s just so much noise so much happening on it and people reaching out for help like you’ve talked about, “Here’s a stack trace I’m trying to figure out.”

But also, your recommendation of reaching out to the code authors, you know what, it’s—people like to help each other. And, you know, it might seem like, “Hey, the people who wrote this aren’t gonna have time to bother with me”—they love to hear from people that are using their stuff to talk about it and help them out, but also hear about how they’re using it, and of course, they’re always looking for ideas and feedback and stuff like that. It sounds like you’ve had that kind of an experience.

Lane: Yeah, absolutely. I mean, and you know, the big thing is just, you know, as a coder, it’s easy to tell the people who are coming to you and just kinda wanting you to fix it for them and the people that are coming to you that have really tried to tackle it themselves. And, you know, I can’t speak highly enough about the latter rather than the former. I mean, you know, just give it your best shot and when you get stuck, reach out to somebody and it’s, you know, it can be immensely valuable.

Ashley: It’s kinda like going to a foreign country. Wherever you are, if you’re American going somewhere else or vice versa, everybody appreciates you trying to speak the native language, and at some point, when they see you struggle enough and how far you can go, they’re glad to help you and, you know, speak in your language.

So, same kind of thing with helping people with code. If you’re gonna just fob your problem off onto the developer of it—not appreciated so much. But they appreciate that you tried to take it as far as you could. As you mentioned, go look at the code and figure out what’s going on.

Lane: Yeah.

Ashley: You’ll get immense respect, you know, even if you aren’t a coder at that level, at that level of software developer, it’ll mean a lot to the developers.

Lane: Absolutely, absolutely.

Ashley: Well, hey, I think you’re gonna have a fascinating talk, and I love that, you know, this idea of trying to figure out what’s happening in production and some of the techniques that you’ve come up with and developed and have experienced and the fact that you’re sharing those with others. I’m curious, do you contribute any code to the Spinnaker project in any areas? Are you primarily a practitioner user of it?

Lane: So, I have. I’ve got a few small RPRs that have been pushed through, but really, the bulk of my commits right now have actually been to the Spinnaker website, the documentation. So, you know, I don’t know Java or Groovy or Kotlin, maybe, as well. Some of the patterns they use are a little bit—I come from a .net world, so they’re a little bit foreign to me.

But yeah, I’ve definitely written a number of different documentation pages and I found that that’s a great way to kinda get in and I’ve even got some PRs that are coming ups soon that aren’t doc related. So, yeah, hopefully, you’ll see my name more.

Ashley: Excellent. Well, you know what, documentation is important, too. There’s some folks doing talks at Spinnaker Summit that, you know, kinda ran into that point where, “Hey, there’s documentation for doing it one way, but not under this set of configurations or software.” So, that’s a contribution, too, so congratulations for being a part of the community and for sharing, also, your experience at the Summit.

So, Chuck, appreciate you being on the podcast today.

Lane: Oh, thank you. It’s been an absolute privilege. Thank you so much.

Ashley: Absolutely. Fantastic. I wish you all the best with your talk, I’m sure it’ll be great, and hopefully folks listening to this podcast will draw some more interest and bring folks to listen to you.

So, I’d like to thank our guest today, Chuck Lane. He’s lead software engineer at Salesforce.com, so you can imagine the environment he’s working in. There’s some super good lessons that Chuck’s bringing to the table. He’s gonna be talking at Spinnaker Summit 2019, which is in San Diego, November 15th through the 19th. His talk is debugging and profiling Spinnaker applications live, and his talk is on Saturday, November 16th at 3:45 p.m.

I’d also to thank you—you, our listeners—for joining us today. This is Mitch Ashley with DevOps.com. Have a great day and be careful out there.

— Mitchell Ashley