What are some of the metrics you should be monitoring to track your DevOps efforts, and which metrics should you abandon?
Last Tuesday I participated in an online panel on the subject of Measuring DevOps, as part of Continuous Discussions (#c9d9), a series of community panels about Agile, Continuous Delivery and DevOps. Watch a recording of the panel:
[youtube https://www.youtube.com/watch?v=rMk5zEtdt0E&w=560&h=315]
Continuous Discussions is a community initiative by Electric Cloud, which powers Continuous Delivery at businesses like SpaceX, Cisco, GE and E*TRADE by automating their build, test and deployment processes.
Below are a few insights from my contribution to the panel:
Dev: What Metrics Matter?
I think the fact that we’re talking about this in as far as the various stakeholders is actually a symptom of the problem. Because when you talk about measuring DevOps – for me you’re talking about measuring the efficiency of your delivery chain, which is not one set of metrics for each of these categories, and I think it’s dangerous to think about the separate categories of what does the Dev analyst metrics fiefdom look like, what does the Ops metric fiefdom look like, and there’s plenty of tools out there and they’re great and they’re amazing, but if you allow those tools to guide what you’re doing aren’t you just doing the same thing that we were guilty of in the past which is siloing things.
So when I think about measuring DevOps, I think about measuring DevOps, so, how successful is your delivery chain? And at the end of the day, is your pipeline in your delivery chain improving? Not just your application, is your pipeline improving, because that is its own entity.
Isn’t DevOps sold to us as the big green go-button? Most of the language you use around DevOps is “More releases”, “Go forward”, “Go faster”, but “go forward go faster” requires you to know where you’re going, I mean, the classic picture I’ve seen with this is the picture of the Ferrari, the super-fast Ferrari that crashed into the tree. Yes that’s great you have the Ferrari, but how can you make sure that that Ferrari is not going to blow up in your face in two years? Or what if they don’t even want the Ferrari? I like the focus that’s coming here, it’s all on the outcome, the result, and unfortunately there is an organizational problem here, going back to what’s your bonus based of? How are you measured? How are you going to get a raise? If your promotion is based on simply having fewer bugs than, I know how tricky people get, with getting their bonuses… so, I mean, I’m guilty, I did a lot of funky things to get my bonuses. This is the famous cognitive bias about anchoring a fact. And I think that especially in Ops, I think this is more happening in Ops.
Ops: What Metrics Matters?
Doing it reactively is much more difficult than eating the effort of doing it right now. We’re facing this right now with our Op-Dev, now granted we want to use every tool out there, so we have that mentality because that’s what I write and speak about, is DevOps – but we made sure that even though some of these things are kind of heavy, up from planning, we are using component monitoring, we are using APM, we’re using even exception monitoring, all of these great things, not because they are benefiting us today, because they really aren’t, and we’re so small and in that MVP stage but it will benefit us later.
I think that we’ve really covered what it is, and I think we all know what Ops is monitoring, it’s monitoring what’s mostly related to related to production and it’s the standard data and most of the people that I interview, I’m seeing something of the more “lean” “modern” NOC, they don’t want to call it NOC because that’s old, but that’s what it is – you got really pretty dashboards now instead of the ugly dashboards. One of the things that are happening in the vendor space, and this is not an attack on the vendors, but it’s a symptom of companies leading with a tool, and expecting magic, there is a term used called “Machine Learning” it’s one of my favorite terms out there, but anyway, and analytics with quote-unquote “Machine Learning” is not going to solve anything for you. It will reduce the noise, and that is critical when we start talking about finding that singularity of what’s impacting, so the technology will solve a real problem but it’s not going to do it on its own, you have to know what you want to use those insights for. I think that relying on Machine Learning to solve the problem is different than saying how you want to leverage the results, that outcome, which is absolutely critical.
The other thing I want to say here is that many of the people I talk to, organizations, have the analytics person, so it’s like the go-to person who has learned everything about their log analysis platform and knows how to write really awesome queries and make dashboards, and the bad thing is that, and they’re most often very annoyed by this, they are the person you have to go to, so you really have no choice – they become a bottleneck, when support needs a dashboard, and eventually marketing gets wind of all the data that you have, and they want a dashboard, and so they’ve drowned in that process and I thinks that that’s an organizational problem.
And the other piece of analytics that we didn’t cover is Incident Management, not so much the Incident Management in the alerting process but the analytics around Incidents: response time, how many incidents you have, etc. etc. My point is not an attack on Machine Learning, I use to work for an NLP company and my degree was focused on Data Mining – it’s not an attack on the technology, it’s an attack on acquiring technology because they said they used Machine Learning. Machine Learning has nothing to do with it, it’s the reduction: the log reduction, the data reduction that supports how you make decisions, but if you don’t even know the process for your decision making in consuming that data – is worthless. It’s not going to answer questions for you, it’s going to make your ability to answer questions easier.
Release Manager: What Metrics Matter?
I don’t really know because my perception of who the Release Manager are much different in the DevOps environment, I actually feel like the Release Manager is today is a steward of the automation, could even be the DevOps role, so I don’t think that necessarily the Release Manager is what he used to be when you hit that gate before releasing. I guess if you consider it then you have to think: End-to-end testing results, and making sure that everything is red instead of yellow, make sure it’s red first, that’s called a canary release… No, make sure it’s green, before everything goes on. So it’s very siloed, but I don’t know that Release Management is what it was in the days of what I call “really fast waterfall”, because honestly that’s what was going on, it was really fast waterfall, that’s my point of view.
In modern applications you’re hinting at the idea of micro-services as well, where I don’t know that it would be even feasible to have a Release Management process.
I’ve yet to interview a company who can tell me that that isn’t the case, and that’s part of the pipeline metrics, can you at the end of the year tell me how many times you actually did a release on time, or releases that weren’t a big disaster, and I think it comes down to the team. Most of the time I think that this is – not the culture word again – but this is a people problem, not a technology problem.
When I say “Really fast waterfall” I’m not indenting to bash agile, I just know that most organizations including startups that I’ve worked for, that is what in reality it is.
Maybe that’s not a metric, I mean releases are a metric and they are somehow correlated but maybe the time period between releases in saying that we’re always going to release every two weeks and were going to close out this sprint on this day and putting a date on your sprint maybe that is kind of, well again, it’s all waterfall things. It’s not six months, it’s two weeks.
But at the same time, and you guys know it being in the Valley, you also don’t have to be the unrealistic startup who says that ops just flat out goes away, and nor do you want them to, I think that they have to very careful at what they’re actually asking for here by taking on a lot of liability. I think that habits are always really hard to break…
A point raised here that is new to me and refreshing, is that maybe your application doesn’t lend itself to micro-services, maybe it doesn’t lend itself to – because I think that to do continuous delivery, I have a very strong opinion that you need to have a certain type of application that has high transaction volume, geographically diverse etc. – but build with that as the goal in mind anyway. So in my point of view DevOps is more of a methodology and a practice that never ends, it’s a journey, you always have your eye on the target, which is a better outcome than today’s outcome.
What Other Metrics Matter?
Just a quick response to the lead time – I used to be a technical product manager, how does the product manager fit in? I think that that’s an interesting aspect because obviously you’re not going to take every feature request and implement it.
So my top three metrics: the first one is a highly neglected area, it is QA functional testing. I think they got the short end of the stick, but they have a unique holistic point of view – so not only should they have their own measurements, but they should be involved in the other measurements, and there’s a great article out there by Greg Sypolt on the KPIs there.
The next one is totally unique and I haven’t seen it done, and I don’t know if it’s good or bad but I’d like to see somebody do this: a scoring mechanism over some period of time of your delivery chain, as a solitary unit – all the aspects that we’ve talked about play into that, but at the end of the year, can you tell me that you have gotten better at releasing software? You are a better software development shop? And how you score that I don’t know, somebody need to come up with that best practice, but I think that it’s one number, over a long period of time, and you say “we have gotten better”.
And one more, just because it’s been fascinating to me, is Exception Monitoring. I don’t see this a lot but the aspect that’s interesting to me about this is the amount of transparency, it fluoresces on the team and it becomes very personal once you start sharing your exceptions during build processes and everybody sees it on a single dashboard – which means you have to be comfortable with that transparency, which is I think a great thing at the end of the day, but going to be very hard to swallow for a lot of people.