This post is based on a new talk by @jesserobbins at QConSF 2012 (slides). Jesse is a firefighter, the former Master of Disaster at Amazon, and the Founding CEO of Opscode, the company behind Chef.
photo credit: John Keatley
Operations at web scale is the ability to consistently create and deploy reliable software to an unreliable platform that scales horizontally. Jesse created the Velocity conference to explore how to do this, learning from companies that do it well. Google, Amazon, Microsoft, Yahoo built their own automation and deployment tools. When Jesse left Amazon he was stunned at the lack of mature tooling elsewhere. Many companies considered their tools to be “secret sauce” that gave them a competitive advantage. Opscode was founded to provide Cloud infrastructure automation. Jesse’s experience helping other companies down this road led to a set of culture hacks that will help you adopt Continuous Delivery.
Continuous Delivery is the end state of thinking and approaching a wide array of problems in a new way. Big changes to software systems that build up over long periods of time suck. A long time and lots of code changes lead to breakage that is hard to solve. The Continuous Deployment way means small amounts of code deployed frequently. Awesome in theory, but it requires organizational change. The effort is worth it however as the benefits include faster time to value, higher availability, happier teams and more cool stuff. Given this, it is surprising that Continuous Delivery has taken so long to be accepted.
Teams that do Continuous Delivery are much happier. Seeing your code live is very gratifying. You have the freedom to experiment with new things because you aren’t stuck dealing with large releases and the challenge of getting everything right in one go.
Learning about Continuous Delivery is very exciting, but the reality is that back at the office things are challenging. Organizational change is hard. Let’s consider a roadmap for cultural change. The first problem is “it worked fine in test, it’s Ops’ problem now.”
Ops likes to punish dev for this.
Tools are not enough (even really great tools like Chef!). In order to succeed you have to convince people that you can be trusted and you want to work together. The reason for this is understood, for example see Conway’s law. Teams need to work together continuously, not just at deploy time.
Choice: discourage change in the interest of stability, or allow change to happen as often as it needs to. Asking the question of which do you choose is better than just making a statement.
Common Attributes of Web Scale Cultures
- Infrastructure as Code. This is the most important entry point, providing full-stack automation. Commodity hardware can be used with this approach, as reliability is provided in the software stack. Datacenters must have APIs; you can’t rely on humans to take action. All services including things like DNS have to follow this model. Infrastructure becomes a product, and the app dev team is the customer.
- Applications as Services. This means SOA with things like loose coupling and versioned APIs. You must also design for failure, and this is where a lot of teams struggle. Database/storage abstraction is important as well. Complexity is pushed up the stack. Deep instrumentation is critical for both infrastructure and apps.
- Dev / Ops as Teams. Shared metrics and monitoring, incident management. Sometimes it is good to rotate devs through the on-call duties so everyone gets experience. Tight integration means a set of tools that integrates tightly with all of the teams. This leads to Continuous Integration, which leads to Continuous Delivery. The Site Reliability Engineer role is important in this model so you have people that understand the system from top to bottom. Finally, thorough testing is important e.g. GameDay.
None of this is new; consider Theory of Constraints, Lean/JIT, Six Sigma, Toyota Production System, Agile, etc. You need to recognize it has to be a cultural change to make it work however. Every org will say “we can’t do it that way because…” They’re trying to think about where they are and extrapolate to this new state. It’s like an elephant (Enterprises) trying to fly. You have to give them a way to think about a way of making incremental evolutionary changes toward the goal.
Cultural change takes a long time. This is the hardest thing. Jesse’s Rule: Don’t Fight Stupid, Make More Awesome! Pick your battles and do these 5 things:
- Start small and built on trust and safety. The machinery will resist you if you try sweeping change.
- Create champions. Attack the least contentious thing first.
- Use metrics to build confidence. Create something that you can point to to get people excited. Time to value is a good one.
- Celebrate successes. This builds excitement, even for trivial accomplishments. The thing is to create arbitray points where you can look back and see progress.
- Exploit Compelling Events. When something breaks it is a chance to do something different. “Currency to Make Change” is made available, as John Allspaw puts it.
- Small change isn’t a threat and it’s easy to ignore. Too big of a change will meet resistence, so start small.
- Just call it an experiment. Don’t present the change as an all or nothing commitment.
- Get executive sponsors, starting with your boss
- Give everyone else the credit. When people around you succeed, celebrate it.
- Give “Special Status”. This is magic. Special badges, SRE bomber jackets at Google… these things are cool and you’re giving people something they want.
- Have people with “Special Status” talk about the new awesome. Make them evangelists and create mentor programs to build an internal structure of advocates.
- Find KPIs that support change. Hacking metrics is important to drive change. Having KPIs around things like time to value is compelling. Relate shipping code to revenue.
- Track and use KPIs ruthlessly. First you show value, then you show the cost of not making the change by laggards. This is the carrot and stick approach.
- Tell your story with data. Hans Rosling has a great TED talk on this topic. This is the most powerful hack. Include stories about what your competitors are doing. There’s no other way to make this work.
- Tell a powerful story
- Always be positive about people and how they overcame a problem. This is especially important with Ops people who tend to be grumpy.
- Never focus on the people who created the problem. Focus instead on the problem itself.
- Leave room for people to come to your side. Otherwise you’ll make enemies. Don’t fight stupid.
- Just wait, one will come. Things are never stable. Exploit challenges like compliance or moving to Cloud.
- Don’t say “I told you so”, instead ask “what do we do now?” Make it safe for people to decide to change.
Remember, don’t fight stupid, make more awesome!