Talk by Jesse Robbins (@jesserobbins), Chairman and CEO of OpsCode
Blogged live from DevOpsDays Boston 2011 by @martinjlogan
Tricks for getting DevOps to work in your company, from the technical to the social, taken from experiences at Amazon.
Jesse Robbins was the “Master of Disaster for Amazon”. As an ops guy, he worked multiple 72 hour sessions, took site up time very seriously, and eventually turned into the stereotypical ops guy who said “no” all the time. What’s worse is he discovered that he was proud of it. Jesse started to notice that he was taking site outages personally. Amazon at the time, 2002-2003, was doing standard ops. They deployed in large monolithic fashion which is an absolutely painful process prone to error. Managing such a process and also finding yourself emotionally involved with the work to a high degree is not a productive situation to be in.
A turning point for Jesse in terms of moving from an obstacle in the way of change to someone that really knew how to add value with ops practice stemmed from a battle he got into with the “VP of Awesome” at Amazon. This was the nickname of this particular VP because it seemed that pretty much any highly interesting project at Amazon was under this man’s purview. What happened was that Jesse did not want to let out a piece of software because he knew, for sure, that it would bring the site down. The VP overrode him by saying that the site may go down, but the stock price will go up. So, the software went out, and it brought the site down. Two days of firefighting and the site came back up, and so did the stock price, and so did the volume of orders.
The dev team went on and had a party, they were rewarded for job well done, new and profitable functionality released. At the end of the year, Ops got penalized for the outage! Amazon rewarded development for releasing software and providing value and operations was not a part of that. They were in fact penalized for something that was out of their control.
This of course did not sit well and as a result of this and other similar situations Jesse actually got famous for saying no. Who in their right mind would want to release software and go through that over and over? When the business would put up a sign advertising new functionality that was to be developed, something they were presumably excited about, Jesse would write the word “No” on it.
Operations naturally wanted to protect itself and came up with all kinds of artifice in order to do so. Root cause analysis so that blame could be assigned efficiently. Software freezes that prevented software from being delivered to the site during peak times of the year. This seemed like progress, but clearly was not looking back on it.
(This all sounded fairly familiar to the folks in the room)
On to DevOps
Now to talk about DevOps. We have at least the beginnings of an idea of what to do about the situation described above. Prior to addressing this situation correctly, or at least more correctly, Ops looked at their output as a waste. The best thing they could do was to cost the site $0. Instead we need to be looking at this another way, bringing value through the function of ops. DevOps is about creating a competitive advantage around the things Ops does every day.
Why does the break occur? Historically Ops creates value by reducing change and getting paged when things break. Dev is about value creation and Ops is about protecting that value. This creates a “misalignment of incentives” meaning that different organizations are rewarded for different behaviors. This creates something called local optimization. Knowing these terms will help you talk to MBAs about DevOps!
We have a fundamental misalignment of incentives and in fact a conflict in incentives. Development is exclusively aligned to releasing software and not at all focused on maintaining it. Ops, is the opposite. Each group optimizes locally around this which creates conflict. Operations is focused on minimizing change because that reduces outage where as Dev is entirely focused on maximizing change.
Solving this problem is what DevOps is all about.
The unproductive way of thinking mentioned earlier came about was in an environment with 4000 devs and significantly fewer ops folks. In order to alleviate the problems caused by misaligned incentives and local optimizations were to come up with those punitive changes that are incredibly satisfying to Ops folks but really don’t help solve the problem. Those changes in the form of meetings and review boards that are around to punishing people into releasing the software the way you the Ops person wants. These are the kind of measures that control oriented people gravitate towards, and it feels like progress for them. To be more DevOps don’t try to fight them in this: “Don’t fight stupid, make more awesome”.
One initial thing that changed, that started the real progress, was to align dev and ops in a way that prevented local optimization; putting devs on call for their own software. This started to shift ops from being the people that just dealt with all the problems to people that became experts on all the services that allow the software to run. Ops started to become tier 2, escalation for devs. The way you got there, was to offer devs deployment options and permissions, if they passed some training and were willing to be on call. Initially this caused a fair amount of chaos. Devs had a load of pagers and got messages that confused the heck out of them. There was pain and frustration. This pain and frustration and the fact that devs were now playing with tools in actual production environments, really changed the culture quickly.
Through trial and error, top down fiat, audits, and every carrot and stick approach the formula for this class of organizational change was developed. This what Jesse uses even today to accomplish these changes and what we will concern ourselves with for the rest of this talk.
- Start small and build on trust and safety.
- Create champions
- Use metrics to build confidence
- Celebrate success
- Exploit compelling events
Start Small and Build on Trust and Safety
We tend to want to take on the entire org up front. We want to throw everything out, starting at the top and working down. This simply does not work, Jesse had failed multiple times before realizing that this was a failing strategy. Continuous deployment for example seems like something you want to deploy widely. Instead you should start with a small motivated team and build some success there.
Another thing to consider when thinking about starting small and building trust is that when introducing disruptive change in an organization you should lead with questions to garner buy in instead of just telling people that you have the solution and such and such is what to do. Don’t even use the word DevOps, just focus on the problems and get permission to start to change it.
You have to make the experiment of changing things safe. Jesse tells people that he will take 100% of the blame if things go wrong, in exchange for the space needed to make the desired changes. Creating safety is critical in pushing through organizational change. Crucial Conversations by Kerry Patterson is a book that covers this really well and the Jesse thinks is one of the most critical books to read if you want to create organizational change.
Create Champions
“You can accomplish anything you want so long as you don’t require credit or compensation”. It is amazing what you can accomplish when you give away that part of your self that requires recognition. You must shine the spotlight on those in your organization that get it. Get those that are recognizing the need and acting upon what you are pushing forward to talk. Make your boss a champion; this is critical. It is really important that your boss can explain what you are doing and why or at least be able to provide air cover for you.
The second part of this is to give people status; special status. SRE “site reliability engineers” walk around Google with leather bomber jackets. They get hazard pay, they have special parties, they are considered to have elevated status around the organization. When you find your champions do something that makes them stand out. Wikis are quite powerful for this. Write down and explain in very powerful language what these champions do.
At Amazon Jesse created this thing called the call leader program. They trained people to handle high impact events. After a while there evolved a pressure to join. Eventually you become the person that people have to go to in order to get a certain status which gives you personally more organizational power – not the point but helpful in furthering the change you want to further.
Use metrics to build trust
Get as many metrics as you can. Begin to look at them for KPI’s (key performance indicators). These are the things you will use to prove your case. What you are looking for is a story, and a set of metrics to prove you have one. John Allspaw talks about things like MTTR, mean time to recover. These are great for story telling. You want to capture metrics early and use them to tell your story. “Having devs on call will be a great thing for us, and oh, by the way, here is the data that proves it!”
Make sure that you are good at telling your story with the data that you capture. Here Hans Rosling gives a TED talk about how to tell a story with data that Jesse recommends for everyone. This can be used for inspiration. Ideally have your champions tell your story.
Celebrating your Successes
This comes back to “you can accomplish anything you want so long as you don’t require credit or compensation”. Create moments in time where you celebrate the success of the change you created. Have parties when you reduce MTTR by 15 percent. Give people to a moment in time where they recognize the change the created and that the change was good. This gives people a moment in time to look back on and judge progress. This is of critical importance.
Exploit Compelling Events
Compelling events are those big company issues, big or small, it does not actually matter. They are the events that create cultural permission to make important change. An example is the executive mandate toward cloud computing. This is a compelling event that allows you to make a whole bunch of procedural change. Big outages are compelling events, they give you cultural permission to make significant change that would be otherwise impossible in normal times.
When you have a compelling event you don’t encounter resistance, but instead, permission to make large change. If you don’t have such an event, then create it. Jesse had something called “game day” at Amazon. Creating outages to test failure recovery. Non-recovery became a compelling event. Big deployment pushes are examples of compelling events. If you are in the middle of a serious problem with process it is hard to propose chucking it out in flight but if you offer to own the postmortem process you can direct it toward the change you want to make – though start small as indicated by the first point in this list.
The next time you want to create change in your org particularly DevOps change keep in mind:
- Start small and build on trust and safety.
- Create champions
- Use metrics to build confidence
- Celebrate success
- Exploit compelling events
What are your thoughts on DevOps culture hacks? Anything to add to the list?