DevOps Chat: Service Mesh Tracing, from Envoy, Omnition to Splunk

As application functions get smaller, containerized, become microservices and combine into service meshes, a new set of challenges crop up. What functions does each service perform? What state constitutes services in trouble? What are the dependencies between services across a complex service mesh? How can we instrument observability and tracing between the service interactions?

Constance Caramanolis, software engineer at Omnition, joined us on DevOps Chats, recorded just before Splunk’s acquisition announcement. After working at Microsoft, Constance joined Lyft, in part to work with Envoy–an open source project created by Lyft that brings upstream and downstream tracing across a service mesh. Constance recently joined Omnition, while still in stealth, to help “flip tracing on its head.”

It’s genuinely a fascinating conversation and gives us a window into the challenges and solutions to managing a service mesh at scale. Listeners should also check out our DevOps Chats episode talking about the Splunk’s acquisition of Omnition with Rick Fitz, SVP and GM of Splunk’s IT Markets Group.

Transcript

Mitch Ashley: Hi, everybody. This is Mitch Ashley with DevOps.com and you’re listening to another DevOps chat podcast. Today I’m joined by Constance Caramanolis, software engineer with Omnition. Constance, welcome to DevOps Chat. Great to have you on the podcast.

Constance Caramanolis: Thank you so much for having me. I’m very excited.

Ashley: Oh, I’m excited too. I love having, pardon the term, but rock stars like you, developers that are doing some cool work on our podcast. The topic is service mesh at scale, but before we get into that, would you first just introduce yourself, maybe tell us a little bit of your background as a software engineer, what you do currently at Omnition, and if you could tell us a little bit about Omnition?

Caramanolis: Yeah. So I’ve been in the industry for several years. I first started off at my Microsoft. So I got a lot of good experience in terms of building large components that need to run reliably. So I worked on Windows and Windows Film there and then I moved to Lyft three and a half years ago where I purposely joined to work on Envoy with Matt Kline and Jose Nino and others within Lyft. At Lyft I worked on all aspects of Envoy from configuration management to adding either integral features for open source community or just rolling it out within Lyft.

The last few months I was on Lyft I worked within our data platforms team just building a variation of a work flow tool. So now I actually just joined Omnition not too long ago. And I’m gonna be focusing a lot on the open telemetry component. We’re using open telemetry through so much valuable data that tracing provides that it’s actually, instead of just looking at individual traces, you’re able to get a whole understanding of what an application does and what impacts _____ have and how it propagates using that tracing data.

Ashley: Easy to see why you made the transition from your work on Envoy at Lyft and now at Omnition. Well, let’s start out with talking about service mesh, microservices, of course containers, et cetera, et cetera. That’s certainly how we’re developing applications these days. Service mesh brings some other things to it, I think some more complexity because you’re talking about a configurable low latency infrastructure layer that’s really designed to handle network based and process communications at high volumes. So that right there tells you that there’s complexity involved and any time you’re building software, creating applications, how you run that in a service mesh architecture has got to bring with it some challenges. In your experience, what are some of those challenges that you’ve seen?

Caramanolis: I think the big one, especially how it ties back to Envoy’s main goal in provide observability is that as you break components apart and put different parts of the internet, doing what is still working versus isn’t working and having a consistent definition of that is very challenging. Good example that we love to use when we’re talking about Envoy, especially before people have adapted Envoy is that there is no standard definition of what a failure is. Some people may say that a 503 is not an error. Please don’t do that.

Ashley: I’ll avoid that on this podcast.

Caramanolis: Thank you. But if you don’t have–not having standard definition of error across different languages and tools makes it actually hard for people from different teams or just say higher up VP’s or directors to know if things are working or not. Observability just across any service mesh is pretty complicated. Another one thing is definitely coordination in terms of topology especially if we’re talking about coming from monoliths. One of the benefits of a monolith is that you know where all the code is, you can figure out where the ____ ’cause it’s usually within one repo and you can just read it.

But then the other–that’s definitely great ’cause you need to see where everything is, but when it gets really large it can take a long time to build, deploying it. You have some pieces of code that run maybe once every ten minutes. So why does that need to be running alongside something that’s high priority. It needs to be really efficient. When you’re bringing things apart it’s gonna be harder to actually correlate how things interact. As everyone usually tries to keep documentation up-to-date, sometimes you’ll forget like, “Oh, I’m actually calling an API that just moved to service B and ___ service A.” So keeping that mental model of the service topology is very challenging and is so dynamic.

Ashley: Especially with microservices architecture. We’re creating so many more of them too.

Caramanolis: Oh, my goodness. Yeah. Especially it could be say one person’s start of day where like, “Hey, just create a new service and deploy it and let it invoke a very low priority API call.” That’s–you immediately depend on how many people you hired, ____ example blowing up the number of services very quickly. I would say probably third challenge of microservices. One is in actually operation. Is that from an operation point of view how do you remediate things when things go wrong? ‘Cause unfortunately, things will go wrong and usually the goal of any application is stay up as long as possible and to serve customers, whatever your definition of customers are.

How do you build tools to handle things when things do go wrong and low do you maybe standardize that so that way you don’t have to train a company’s engineers in ten different tools to say, “Handle a certain error case or three different error cases and standardize in that”?

Ashley: Very interesting. It’s compelling to me why you might pursue a path of using a tool like Envoy. Tell us a little bit about how did you end up using Envoy.

Caramanolis: I joined Lyft to work on Envoy because I had actually met Matt almost a year before I ended up joining Lyft, but he was talking about how–when we were talking about it it’s just there is a need for Lyft to scale ’cause especially one of the big problems that was experienced is that we were not getting really good observability on our ELB’s. So sometimes we could see things–we know that say I was–for example, I was trying to ____ ride, but when we’d look at our metrics we wouldn’t be able to see things failing.

So that definitely clearly led to an initiative you saw that our observability along, just our regular request path need to be improved. So Matt and I–and that’s just one example. Matt definitely talks more about the motivation of why Lyft built Envoy, I think in one of his earlier talks. It was definitely–and also just it was an area–I’ve always loved infrastructure and backend. And so being able to build something that the entire company is reliant on, Lyft being able to scale is a road blocker to us being successful.

So Envoy allowed us to scale because we could add 10 or 100 services instantly and not have to worry about service discovery. So just being a part of something so critical and the opportunity to just learn so much and I’m gonna say I definitely–I was intimidated by the project. So I loved the idea of joining a project where I’m gonna get–mentally get my ass kicked in terms of I have no idea what this is. It’s swimming in the deep end with sharks and I can’t wait to see if I can learn how to swim and see how great–how this shapes me as a developer.

Ashley: You are a courageous person. Not everybody is willing to expose themselves that way to that much risk. So congratulations. That’s great.

Caramanolis: Thank you.

Ashley: Fantastic. More of us should do that. Well, you gave a talk just recently, I think it was December of ’18, KubeCon talking about reducing in the meantime to detection specifically with Envoy for service mesh. Say a little bit about that.

Caramanolis: Yeah. So a lot of the talks–so I’m gonna say we, as a community. There’s a wonderful community of either contributors and maintainers and also just people who bring blogs around Envoy, a large of the community has talked about either Envoy’s features and how it’s helped them from more of a technological point of view of how it’s helped them to scale and become reliable and just debug issues. It was definitely one thing that we’ve experienced internally is how do we translate that so the rest of the developers, and when I say the rest of the developers, like application developers. Not the DevOps or infrastructure engineers who are either enabling ongoing ____. How do the rest of the application developers use Envoy to their benefit day to day?

So I was trying to make a talk that would show how I use Envoy at Lyft to identify where issues are coming from and build that up to be more digestible to everyone else. Also tried to highlight that I know with the amount of work I’ve done on Envoy I have built my own blind spots about things that I’ve forgotten or were critical. So I was using an example that I had a very lose definition of HP status codes before I joined Lyft. I’m sure there’s so many application developers who don’t have that–who don’t have a solid idea of what those mean.

So I would have to give presentations on what these meant or what these ____ meant and how they’re related to Envoy just so that way everyone would have the same set of tools going forward to better debug issues when they happen. So yeah. I was trying to make that digestible as a presentation.

Ashley: Interesting. Yeah. I heard the talk went over very well. I’m sure you had a lot of people come up to you after with some good questions or wanting more information. It’s a great way to share. When you and others contribute back to the community, not just code or other things, but also talks, and people get to see real applications on Envoy and the technologies. Anything you’ve walked away with in terms of learning since giving that talk about either reducing meantime to detection or implementing Envoy for service meshes?

Caramanolis: I think some of the bigger learnings is that–or maybe the learnings that resonated more with me is that these types of talks need to happen more especially within Envoy and giving say an intro debugging talk or even making interactive. So say if we’re able to set up a test environment where we had Envoy’s, like a mini microservice environment. Then you can do test requests and see how they failed. And after give people a hands-on safe way to learn how Envoy worked, that’d be really great because you can see everything on a slide. You can listen to it, it can resonate, but applying that to day today is a little hard without there being someone to bounce ideas off of.

Envoy as a project itself has been really, really successful and there’s been amazing contributions and hearing ____. I was very lucky to work with a lot of really smart people from Google and all these different companies. I’m just mentioning Google ’cause those are the people I worked with most closely. All these really, really smart people, I got to learn from them indirectly or directly, I should say. But also at conferences people would come up to us and say, “Oh, I was trying to do this with Envoy,” and you’re like, “Oh, I never thought of that.”

And Envoy Con, there was just like 30 minutes talk of people would come up with, “Oh, we’re trying to do this one case here. So we built our own filter and we did that thing that way.” You’re like, “Woah, that is really cool.” Just seeing how other people think a problem differently really could highlight either gaps in our own knowledge or like, “Oh, maybe we should try their approach.” So I think maybe that’s the most valuable thing about conferences. Maybe I could share with people, but hearing what other people have thought about this topic really teaches me a lot about different ways to solve a problem.

Ashley: It really is. A talk really is almost like initiating the two way conversation gets to happen. That’s what kind of–seize it with lots of good information for people to come up and talk with you after. So that’s awesome.

Caramanolis: Oh, yeah. Actually that’s a really good way of saying it. Yeah.

Ashley: Well, what advice would you have for anyone who is maybe not experienced in using Envoy, if they’re just approaching it. How would you suggest that they get started and maybe are there some early lessons learned that you could share with them about how to more effectively use it?

Caramanolis: Usually when these types of questions come up I always ask people what problem are they trying to solve. At least one common question that was at KubeCon, and since this is the context of KubeCon, is “Should I migrate to Kubernetes or Envoy?” I always reply back with, “What is your most pressing problem? Are you having issues with network observability or standardization of error cases, or is it that your current deployment pipeline isn’t working as expected?”

So say they wanna focus more on the Envoy part, then I wanna–then it was like finding out more about their topology. So the Envoy configs. Anyone who has definitely worked with Envoy and has listened to this will probably giggle at this. The Envoy configs are definitely overwhelming, especially for those who are initiated into it. And so starting off really simple. Either Matt–I think Matt and myself, one of us had talked about how we had rolled that Envoy at Lyft. So we started off with either do it at edge or do at ingress with no service mesh or egress and slowly built that up there is like building up at one part of the interactions instead of doing it all at once because all at once, there’s so many components that can go wrong.

One misconfigured value, it could be technically correct value, but say you put the wrong port, but you hear it everywhere, then finding out where that wrong is very hard. So definitely my advice would be starting off small, set a really clear scope, either ingress or egress out of one service or set of services or just at your edge and then building that trust within your developer community. So once you have that definitely start educating the rest of the developers who are using it saying, “This is how the errors look like and this is what that means,” and helping them see that value, ’cause once all the developers see the value it definitely–at least for my simplify their lives ’cause they no longer have to spend time like “I know this one error is failing, but either which service is causing it or do I know if it’s a network issue or is it bad application code?”

With Lyft, Envoy was able to very much isolate it to service B is having a bad day and we can see if it’s having any other impact anywhere else, but at least we know where to focus on service B. If there are any questions, the Slack community, the flag channel for Envoy is very responsive if you do run into issues, like always ask questions there or post an issue and get hub and people do make a really concerted effort to reply as quickly as possible. I would say the community–

Ashley: That’s fantastic.

Caramanolis: I really respect them and love them and I do miss working with them on a more daily basis.

Ashley: Can you maybe translate or transition into a little bit of what you’ve been doing at Omnition? What kind of work are you doing now?

Caramanolis: Yeah. I’m actually gonna relate a little bit back to my KubeCon talk. Part of my KubeCon talk was talking about how I had this one issue. I know it’s hitting–I get, like say, if I’m gonna use an example, I think with the example I used and the talk was I can’t see photos. So I know it hits on the edge and it goes from service A to service B to service C. What allowed me to do that with Envoy is that Envoy has very clear metrics of your upstream and downstream colors. And to avoid defining that within here ’cause some people have different definitions, pretty much Envoy tells you what services your dependent on.

So without having those metrics of knowing where your service is dependent on, sometimes it is hard to track down what piece of code is causing an issue. So one way actually people do that is actually with traces. If you trace–say I know that this request is failing. I could look for this trace and then see that it’s going from service A to service B to service C, which is really valuable. So what Omnition is trying to do is actually foot tracing on its head because usually the normal paradigm is to look at an individual trace and then try to correlate other data around it. Either it’s like an input value to a request.

It’s either, “Oh, we know it’s always service B that’s having–after ____ is having a bad time.” And so it’s usually start from a really granular data point and build out the information. We take all the trace data and you’re able to see that everything went from service A, some things go to B, some things go to C, D. And then so if something’s going wrong it’ll be a red line and you’ll just say, “Oh, I know that service B is caught in this error. Let me look more into it.”

Ashley: That seems extremely useful.

Caramanolis: Yeah. And it’s like–it’s actually–it almost would make my talk from KubeCon obsolete ’cause I’m trying to teach the ____ with using Envoy as like I follow the service graph, but I follow symmetric that Envoy produces.

Ashley: I wish you the best. It’s tools like that that really are essentially to be able to grow the infrastructure the way that we’re building software now. Congratulations on the move and I’m excited for you, Constance. I wish you all the best. Thanks for being on the podcast.

Caramanolis: Mitch, thank you so much. It’s such an honor to be on the podcast. I had a lot of fun.

Ashley: Well, I did too and I’m honored to have you on here too. You’ve listened to another DevOps Chat podcast. I’d like to thank my guest, Constance Caramanolis, software engineer at Omnition and thank you too, of course, our listeners for joining us. This is Mitch Ashley with DevOps.com. You’ve listened to another DevOps Chat podcast. Be careful out there.

— Mitchell Ashley