DevOps Chat: Shift Ops Left, Shift Dev Right with Kristian Stewart, IBM

In this DevOps Chat we speak with Dr. Kristian Stewart of IBM’s Cloud Event Management team. Kristian has some keen insights into the role of dev and ops and how they interact. Kristian calls it Shift Ops left, shift dev right. An interesting way of looking at things.

As usual the streaming audio is immediately below and the transcript of our conversation is below that.

Audio

Transcript

Alan Shimel: Hello, everyone, this is Alan Shimel, DevOps.com, here for another DevOps Chat. Got a really hot chat today and my guest joining me is Kristian Stewart of IBM’s cloud event management team. Kristian, welcome to DevOps Chat.

Kristian Stewart: Thanks very much for having me, Alan.

Alan Shimel: And I just wanna make sure I got it right – it is Kristian Stewart, right?

Kristian Stewart: Last I checked.

Alan Shimel: Okay. Well, sometimes my funny French accent throws people off, but, anyway – and, Kristian, I mentioned you’re working with IBM in cloud event management, but give our audience a little bit behind that. What exactly does that mean?

Kristian Stewart: Well, let me give you a bit of a background ’cause there’s some history here, in particular with respect to the topic of the podcast, which talks about operation tools. So I’ve been with IBM for ten years. I’m one of the lead architects for a suite of IT service management tools, and my own specialization is in event management, as part of IBM’s Netcool brand, okay? So I’ve been in this industry for 18 years.

I worked for a small start-up back in the late ’90s – the company was called “Micromoves,” and we were focused on introducing tools for disciplined fault management into telcos, into communication service providers, during the dot-com boom of the late ’90s. And then, in the early ’00s, we pivoted and successfully sold into enterprise, finance, retail, as their IT infrastructures became larger and more complex and as they matured, from an IT operations perspective. So we were acquired in 2006 and we started to work towards applying interesting mathematical and analytical techniques, for increasing the value provided by our tools, and applying machine learning and problems to this space. So that was my background up until about three years ago.

So what do I mean by “cloud event management”? Well, more recently, as part of the movement of the industry that we serve, as they move into cloud and their investment in cloud has increased, I’ve been working with teams to build software-as-a-service offerings capabilities we had, till that point, only available on-premise, so it includes alert notification tools, runbook automation tools, and cloud event management. And so our particular interest in DevOps is twofold, right? So our transition itself, from going to shrink-wrapped client service software, that was originally deployed on-prem, has gone through to the development of cloud-native, cloud-scale, public cloud resident offerings, and with our guys embracing paradigms like micro-service architectures, 12-factor apps, and, of course, DevOps.

And one of the biggest challenges or one of the biggest changes our teams have had to endure is that they now have to operate these deployments themselves, right? So they’ve gone from shrink-wrapped, “Give it to the client. Now it’s their problem,” – well, kind of – through to “Now we’re hosting this for you.” So we’ve had folks from development and QA transitioning to operations roles, and they’re required to apply software engineering skills and work with app developers to ensure that we’ve got robust capabilities in private cloud. And we’re providing 24/7 ops to support these offerings.

But we find ourselves in a fairly unique position because we’re running operations but as a service product that themselves are designed to help people responsible for running operations for their own apps and services. So it’s just sort of a renewed empathy for our end users, especially as we start to angle our offerings towards teams that are embracing DevOps. Does that answer your question?

Alan Shimel: Yeah, and then some. [Laughs] But good stuff, Kristian. Thank you very much. So wanted to talk a little bit today bringing this back home to DevOps, right? And one of the trends that we’ve been seeing over at DevOps.com recently is, really, around kind of putting the “ops” in DevOps, right? ‘Cause we’ve spoken now to a few folks in the DevOps space from both the vendor and practitioner side, who are really emphasizing the role that ops plays in DevOps. And one of my things about it, though, Kristian, is I get the opportunity to speak to a lot of people like yourself, a lot of people who are heavily invested in DevOps, but I also get a lot – the chance to talk to a lot of people who aren’t and are new DevOps or just dipping a toe in the water and planning their DevOps migration and transformation.

And one of the things that I try to remind people is that “Don’t leave your common sense at the door,” at the DevOps door, right? Bring it with you, number one. And, while you’re bringing something with you, it doesn’t mean everything you’ve done in ops to this point and all the lessons learned and best practices accumulated get thrown out – you know, the baby with the bathwater. With DevOps, we evolve. We’re not necessarily revolutionizing; we’re evolving. And so building on best practices and time-tested ops practices and methodologies and processes is probably a better solution than, you know, scrapping the whole thing and starting from scratch. And who has the opportunity to do that anyway? What do you say to that, Kristian?

Kristian Stewart: Oh, I agree completely. And, you know, amongst our clients, we’ve got a whole array of clients, from telcos to finance to retail, right? You name it. Across those clients, we have people on varying points on the curve of maturity, with respect to adopting DevOps. But what I say to those clients, as they look at those paradigms, is that the business objectives of operations teams haven’t changed substantially, certainly since I’ve been in the business. There’s a lot that has changed, but the business objectives continue to be measured by KPIs like mean time to repair, mean time between failures, and pressure on costs, right?

So disciplines – so, within a, let’s say, a mature enterprise that have invested heavily in ITIL, they adopt disciplines like event and incident management, and that leans on a key set of capabilities, which are equally applicable when you transition to DevOps. You know, the proximity of your developers to your operations and the smoothing of the processes and the interactions between those organizations, if anything, are enhanced by some of the traditional methodologies and tools. So, you know, to list some of them, you need tools which can consume events from highly heterogenous environments; you need to minimize the amount of management noise that’s presented to the people who have to respond to events; you need to be able to integrate with other systems; you need to be able to help pinpoint probable causes, drive efficiency in operations, do things like apply machine learning and advanced analytics to gain insight from your event and incident data.

So, if you look at the evolution of these practices, historically, right, ITIL emerged in parallel with the practice of event incident management. It didn’t dictate it. You know, these standards emerged as – sorry, event and incident management emerged as de facto practices, driven by the availability of practical tools to solve real problems, and they still solve those problems. You know, if you look at the way that operations has evolved, look at telecom from the late ’80s and early ’90s, right?

So, in a telco network operations center back then, you might have had dozens of very highly-skilled people, _____ network engineers, with training on dozens of different management tools, and these guys and girls were expensive, right? I mean, their time cost a lot of money, and then the introduction of event and incident management tools helped transition from that to having perhaps the same number or even more operations staff of a substantially lower skill. And so the first thing event management and incident management did was to try and lower the cost of entry for running an efficient operations team.

But now, if you look at the changes in IT management now – so I was chatting to a large enterprise just the other day, and, ten years ago, they – so these guys have always been driven by those same metrics – MTTR, MTBF, cost, right? A few years ago, their job was to provision physical hardware, static middleware, and the management of that stuff from an operations perspective. Now these guys are having to provide platform-as-a-service tools, distributed runtime container and orchestration systems, and they still have responsibility for the existing infrastructure – the network, the physical servers, the legacy monolithic applications. And all that still requires management. So, mainly, they’ve got the challenge of managing hybrid cloud, coupled with extremely tight expectations on time-to-market that working practices like DevOps and Agile software development allow for.

So I’d say that event and incident management tools are as relative now, if not more so, than they ever were because, when you shift to practices like DevOps and site reliability engineering, you know, back to the olden days where you have ops employing of hundreds of you know, low-skill operators, best practice is transitioning to one where you have a substantially lower number of substantially higher-skilled operations staff. Right? So these guys and girls are highly skilled, but they’re time-poor. You know, your typical operations professional, in an organization that’s embraced DevOps, wanna be concentrating on automation, on liaising with dev early on in the release life cycle, on prepping for new rollouts. They don’t wanna be grappling through log files and sifting through performance metrics to find out what happened just so they can restore a broken service.

And I think the eventual goal of the kind of tools that we provide – event and incident management – isn’t just to help the operator find the needle in the haystack; I think it’s to hand them the needle. And that’s useful, whether you’re a low-skilled junior operator in old-school operations or a time-poor, hyper-technical site reliability engineer with 20 years’ experience. Alan, you still there?

Alan Shimel: Yes. Kristian, I’m sorry. I had a little internet issue. So all really good, but, you know, one of the things I wanted to touch on, Kristian, is the subject of empathy, as part of kind of your DevOps mantra, right? Sometimes, people give cultural things short shrift, but empathy is important in DevOps. And what I wanted to ask about was the idea of “How do we get developers?” Right?

For a long time, the ops people have sort of had to empathize with developers are doing and what they’re doing in terms of development to help in deployment and stuff, as part of DevOps, but now, putting that on the other foot, right, getting developers to empathize with what the ops folks are dealing with, as part of this DevOps equation. I mean, I’ve heard of some organizations where they’ll actually embed developers on the ops teams for short periods of time so that they get empathy the hard way, if you will, right?

Kristian Stewart: Yeah.

Alan Shimel: They gotta live it. What do you think about something like that?

Kristian Stewart: I think, in terms of getting the development professionals involved, I think – so you mention role rotation, right? And I think that’s one tactic. You know, I like to think of this as much as “shift-right dev,” as it is “shift-left ops,” right?

Alan Shimel: Yeah.

Alan Shimel: Good. Good.

Kristian Stewart: And, actually, you know, it reminds me of the situation in shrink-wrapped software development, maybe 20 or 25 years ago, between development, who were the coders, and QA. Right? So there was sort of a brick wall, prior to Agile software development, where it was throwing a pile of code over the wall. And I think Agile, with that shift-left testing, really helped to remedy that, to the extent that I see, today, developers and quality assurance engineers working really, really closely together and not depending on, you know, reams and reams of documentation to communicate with each other.

So, I think, with DevOps, I think a similar thing is happening, and I think you need shift-right dev, as well as shift-left ops. So operating an application or a service, I think, has to be a shared responsibility, and I do think that dev need to get skin in the game. So that’s gonna include dev as an escalation point, when an incident occurs. It can also include role rotation. But, for mutual benefit, I think they both need to use the same operations tools, right, so that they see what each other are doing, when an incident or a critical event occurs in the environment. And then they’re motivated to improve the use of those tools for themselves, as well as for the sake of operations.

I think, you know, a 3:00 AM incident notification, you know, like a page at 3:00 AM, is a great motivator. So, for ops people, it’s an incentive to get runbooks and automatic incident responses written so that service restoration is as quick as possible and, preferably, automatic. With dev, it’s an incentive to make their services or applications operable, right, so they need to get the right amount of instrumentation into the app so that ops can manage an incident or, better still, that some script can, so that neither of them get the 3:00 AM escalation. So I think there’s a – with all due respect to my dev and ops colleagues, there’s a carrot-and-stick approach there.

Alan Shimel: Mm-hmm. Yeah, that makes perfect sense. Kristian, I want you to repeat what you said before. We need to shift ops left and shift devs right or did I have that backwards?

Kristian Stewart: Yeah, no, yeah, that’s what I said. Shift-right dev and shift-left ops, right?

Alan Shimel: Got it.

Kristian Stewart: So get them to have a degree of overlap with respect to their responsibilities. And, of course, what they do in response to those responsibilities can be quite different. You know, let me see if I can think of an example. So, for devs, under those circumstances, when they’re being involved in responding to an incident, it can be stuff as simple as being really clear in their application logs.

So, if you think about events that are generated by, say, a container-based app deployed on Kubernetes, okay, so the event and incident management system is gonna be presented with logs from the app code; they’re gonna be presented with logs from middleware; logs from Kubernetes itself; logs from, potentially, external technologies and monoliths – NFS mounts, databases, legacy apps, network; and they may also be getting alerts from monitoring systems that are watching performance metrics for the running container, so transaction request rate between micro-services, CPU utilization, response time.

So, if devs can make sure that something as simple as the concept of criticality or severity is included in the logs, that really helps with filtering _____ ops figure out whether or not something’s really a problem. You know, for example, is this event classed as a trace message? A debug message? A warning? An error? Is it known that this is service-affecting? Right?

Alan Shimel: Yeah.

Kristian Stewart: And, you know, how better to motivate devs to implement those sorts of practices than to make them consume log messages that are completely uninterpretable, right?

Alan Shimel: Agreed. Agreed. Hey, Kristian, we’re over time already, but, as I told you before we started, the time here goes quickly. But, you know, boiling all of this down, the last – whatever – 18 minutes down, really, I wanna make the point of – and I loved what you said about ops shift left, dev shift right – that our ops and devs can work together. And, as you said, they each will have different reactions with different roles and responses to given incidents or given stimuli, but they can work together and we can build on what ops has built over the last 50 years of the IT industry. And –

Kristian Stewart: Mm-hmm.

Alan Shimel: Right? We don’t leave that at the door of DevOps; it’s part of it. And one thing I want – you know, if we can ask people to take that with them. And you agree that’s key takeaways here for them?

Kristian Stewart: Absolutely. Couldn’t agree more.

Alan Shimel: Fantastic. Hey, Kristian Stewart, IBM, thank you for joining us on this episode of DevOps Chat. Love to hear more in the future about what you and the IBM team are doing around cloud and hybrid and some of the other stuff kicking around. And we’ll have you back again soon, but, for now, thanks for being our guest this episode.

Kristian Stewart: Alan, it was an absolute pleasure.

Alan Shimel: Great. Hey, this is Alan Shimel, everyone, for DevOps.com, DevOps Chat. Thanks for joining us and we’ll see you soon on the next chat.

— Alan Shimel