DevOps Chat: 7 habits of successful DevOps

I recently had a chance to speak again with Sam Guckenheimer of Microsoft about his experience in leading the transition for Microsoft Visual Studio to Visual Studio online. In doing so, Sam and his team developed the 7 habits of successful DevOps. You can listen to my interview below or read along below that. Also if you want to download the Microsoft paper on the 7 habits of successful DevOps you can do so here.
[soundcloud url=”https://api.soundcloud.com/tracks/234949072″ params=”color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false” width=”100%” height=”166″ iframe=”true” /]

Hi, everyone—Alan Shimel, DevOps.com, here for another DevOps Leadership Chat, and I’m joined today by Sam Guckenheimer of Microsoft. Sam, welcome.

Interviewee: Hey, Alan. How are you doing?

Interviewer: I’m doing well. It’s good to speak to you again. The last time we spoke was at the DevOps Enterprise Summit out in San Francisco.

Interviewee: Yep, that was a lot of fun. That was a great conference.

Interviewer: Yes, and you know what? I think—well, you contributed to making it great, Sam. It was a great presentation you made, talking about, kind of documenting the journey of Visual Studio from a COTS, you know, traditional kind of software model to a SaaS based program, Visual Studio Online. And, you know, I think that’s something that a lot of our listeners may be interested in, and how do they turn their current software offerings into SaaS based. And along the way, of course, you’ve come up with a, I guess you call it the Seven Successful Habits of DevOps?

Interviewee: Yeah, so we think about the seven habits that we’ve applied as we’ve done our transformation. We started around 2010, we were in the, if you will, the pinnacle or the leader’s quadrants on all the analyst assessments as a provider of software for on-prem use in what was then called ALM—application life cycle management. And we said, “Hey, we need to become a great SaaS provider as well.”

So we think of things in terms of these seven habits that we need to provide. Three of them you can think of as sort of Agile on steroids or second decade Agile. That’s team autonomy and enterprise alignment, is number one. We work in scrum teams or feature crews, think about 8 to 10 engineers and a product owner. They autonomously pull items from a common product backlog, and at the same time, they align to enterprise goals on a six month planning cycle, we call this a semester or a season, where we know, in that time frame, what are the needles we’re trying to move for the business? What are the things we’re trying to do at the epic level that count, and Sprintly we measure, how are we doing against that, and the things that each feature crew is doing are counting against that. They may move forward, they may move backward, they may decide, based on the data they’re collecting to do more or do something else. But you know, that’s in their control based on the things we’ve agreed for the six months.

The second habit we have is a rigorous management of technical debt. We’ve known about technical debt for a long time, but when you’re running a service, you really can’t afford to let any debt creep in, and you have to stay clean all the time. So we make sure that we’re constantly monitoring both what’s happening in the production service and what’s happening in the development process. So if, for example, there is anything on the live site going wrong, we attend to it immediately. If there’s anything happening in the development process, like a feature crew is letting its bug count start to ride up, they have to stop immediately, fix the bugs, and can’t do any new work until they get them down to zero, so we stay clean on tech debt as we’re going.

The third habit is a focus on the flow of customer value. We’ve talked about this since early days of Agile. Now, with SaaS, we can measure it. So we have great telemetry in the service, and we look at how people use our software, both quantitatively as we’re going forward; that is, we actually observe what gets used, what doesn’t get used, what the scenarios are, and we reach out to customers qualitatively, where we engage with the high usage accounts and we say, you know, “Let’s talk to them.” And we encourage our engineers to buddy up and reach out and say, you know, what’s gone well, what’s gone not so well, what would you like us to do that you don’t see? Are there things you wish for that we could be doing for you, and so on.

So we reach out to customers, we listen to them through things like UserVoice and we observe their behavior through telemetry on a continual basis. That’s all kind of Agile on steroids.

Interviewer: Mm-hmm.

Interviewee: You know, those are the first three. Then you get to the stuff that’s really new with devops. The first is, of those, number four is what I’d call hypothesis driven development or a backlog that gets groomed with learning. This is what Eric Ries, in Lean Startup, calls “build, measure, learn.” So, in other words, the idea is that the backlog, instead of being the just, you know, what an omniscient product owner things is right, it’s—we recognize that it’s a set of hypotheses or beliefs. We think that these things will improve our service, but we’re gonna treat them as hypotheses and we’re gonna turn those hypotheses into experiments, and we’re gonna collect data against those experiments so that we can substantiate or diminish the hypothesis, and that data will be evidence gathered in production. That’s the fifth practice—everything is evidence that we need in order to build up or draw down against those hypotheses. We look, we use instrumentation on everything, we use telemetry on everything, and we measure it against what’s important.

So, for example, if we think something’s gonna be a better sign of experience, we will run the new against the old in parallel, and we may, if we have multiple front doors, like in our case we have a web experience and an experience through the IDE, the integrated development environment, we’ll run those in parallel. So that would be a four cell test, you know? New and old, web and IDE, and we’ll see what the evidence shows us about how much better the new is than the old, and hopefully the new is better; it might not be. Hopefully we’re right, but sometimes we’re wrong. In any event, it’s the data that will tell us, and it’s a “bring your data” world.

So, evidence gathered in production is our fifth habit, and that is made possible by our sixth, which is what we call a production first mindset, or what Satya, our CEO, calls a live site culture. The idea of a live site culture is that the site status is always first priority, and with site status, you have two goals. If there is a live site incident, you want to, as quickly as possible, remediate so that customers aren’t affected, and you want to drive to a root cause analysis. Practice the 5 Whys like Toyota, get to the root cause, and identify the actions that you need to take in order to prevent a recurrence of such an incident or something similar, and those root cause actions that you take, you need to complete in one or two sprints so that you’re hardening the service as you go learning from the live site incidents so that you’re always getting better.

That’s number six, and finally, number seven is that we manage infrastructure as a flexible resource. We can do this because we use the public cloud—of course, in our case, that’s Azure. We work as a customer of Azure just like our external customers would. We take things a little earlier, but otherwise, we’re a customer when we need more resources for development or for load testing or to spin up more labs, we just do it.

And we continue to expand the number of data centers we run, and we now run our Visual Studio Team Services in eight data centers around the world; that’s out of 24 regions that are currently active in Azure, and we fully automate the deployment, but we start each Sprintly deployment with a canary instance, which is where we work. It’s our own one of those stamps, and the one where our closest early adopter customers are so that we know from the canary if there’s anything wrong with the deployment, and if there is, we then restart at the canary before we roll out to the subsequent deployment rings. And that practice of canary-ing, going big with load testing is part of the deployment pipeline, and then using the automation to roll out in subsequent rings is something that we’ve used to harden the service and let us get to a global scale on a regular basis.

Interviewer: Got it. So, Sam, these are obviously lessons learned the hard way, right, [Laughter] by doing and paying the idiot tax and doing that.

Interviewee: Yes.

Interviewer: For our leaders who are listening to you now, where to begin? Where to start? Obviously you can’t do all seven at once. Some of them are dependent on others. Where do they begin?

Interviewee: Well, I think that the first thing is to make sure you have Agility down. I think you cannot do this if you are a command and control organization that doesn’t really let teams learn, and to recognize that you do need to measure your time to learning along with your other measurements, you also need to totally automate your release pipeline.

So, the notion that you can have manual steps somehow between a development pushing code or checking in and the release of that is a thing of the past. You know, start with continuous integration and then quickly go to continuous deployment and put the telemetry in place so that you can see the evidence from production that gives you the feedback on everything that’s getting deployed.

If you don’t have those things in place, the rest will crumble.

Interviewer: Yeah, agreed.

Interviewee: Sam, we unfortunately are coming up near the end of our time. I want to ask one more question of you, though, and it’s along the same vein. So, you’ve laid out the steps here for us. How did you get your team’s buy-in on this—or did you?

Interviewer: Well, we certainly did. I don’t think it was very hard. I think that we all wanted to move to this world. I think that there was a broad recognition that we needed to become a first class SaaS provider. I think that there were some additional quirks in our case of being a great SaaS provider while we continue to be a great provider on-prem software, and maintaining that engineering discipline.

The teams were all for it. I think the leadership needed to recognize that we were investing in new practices and that we needed to maintain the investment in engineering systems and engineering practices so we have a rule of thumb that about 20 percent of our investment goes toward non-customer visible engineering, things like live site work, and we don’t quarrel about that. If we need to do live site improvements, we do them.

So I would say there was a little bit more work on management to accept the change in practices than there was on the ground with the engineering teams who were sort of chomping at the bit to get going.

Interviewer: Well, there often is in devops, right? You have that “from the bottom up” and getting true top down sponsorship and cover is a problem, right? Not a problem that cannot be overcome, but an issue that we’ve explored previously in our leadership suite here.

Interviewee: Yeah. Yeah, it helps that our, right now, our CEO comes from this world, Satya Nadella, so he is pushing everyone in the company to make the change and to learn the new practices. There were some folks in the middle who were not comfortable, necessary, but you know, they’ve come around.

Interviewer: As they often do.

Interviewer: Sam, anyway, we are probably over our time and I apologize, but thank you very much for sharing your Seven Habits for Successful DevOps. Continued success with your Microsoft Visual Online and the rest of your endeavors at Microsoft, and we hope to have you on again some time in the future and we can talk more about leading the transformation at companies around the globe with these new ways of doing things.

Interviewee: Thanks, Alan. I’ll do one commercial, which, which is that there’s a URL, microsoft.com/devops, where you can download an e-book about our journey and a devops self-assessment if you want to read more.

Interviewer: You know what, we’ll include that link in the article, too, in the transcript so people can get that. Sam Guckenheimer, Microsoft—thank you so much for being our guest today.

Interviewee: Thank you very much, Alan.

Interviewer: Okay, this is Alan Shimel for DevOps.com. Thanks, and have a great day.