By Seth Thomson and Chris Read @cread given at Camp DevOps 2011
This post was live blogged by @martinjlogan so expect errors.
This talk is about how to overcome organizational hurdles and get DevOps humming in your org. This illustrates how we did it at DRW Trading.
DRW needed to adjust. The problem was that we are not exposing people to problems upfront. Everyone was only exposed to their local problems and only optimized locally. We looked and continue to look at DevOps as our tool to change this.
Cultural lessons
[Seth is talking a bit about the lessons that were learned at DRW that can really be applied at all levels in the org.]
The first ting you need to do if you are introducing DevOps to your org is define what DevOps is do you. Gartner has an interesting definition, not sure if it reflects our opinions, but at least they are trying to figure it out. At DRW we use the words “agile operations” and DevOps interchangeably. We are integrating IT operations with agile and lean principles. Fast iterative work, embedding people on teams and moving people as close to the value they are delivering as possible. DevOps is not a job, it is a way of working. You can have people in embedded positions using these practices as easily as you can for folks in shared teams.
The next thing you need to do is focus on the problem that you are trying to solve. This is obvious but not all that simple. Here is an example. We had a complaint from our high frequency trading folks last year saying that servers were not available fast enough. It took on average 35 days for us to get a server purchased and ready to run. Dan North and I were reading the book “The Goal” – a book I highly recommend. It is a really good read. In the book he talks about the theory of constraints and applying lean principles to repeatable process. We used a technique called value stream mapping to our server delivery process. People complained that I [Seth] was a bottleneck becuase I had to approve all server purchases. Turned out I only take 2 hours to do that. The real problem laid elsewhere. The value stream mapping allowed us to see where our bottlenecks were so that we could focus in on our real bottlenecks and not waste cycles on less productive areas. We zeroed in accurately and reduced the time from 35 to 12 days.
The third cultural lesson, and an important one, is keep your specialists. One of the worst things that can happen is that you introduced a lot of general operators and then the network team, for example, says wow, you totally devalued me, and they quit. You lose a lot of expertise that it turns out is quite useful this way. Keep your specialists in the center. You want to highlight the tough problems to the specialists and leverage them for solving those problems. Introducing DevOps can actually open the floodgates for more work for the people in the center. We endeavored to distribute unix system management to reduce the amount of work for the Unix team itself. This got people all across the org a bit closer to what was going on in this domain. What actually happened is that the Unix team was hit harder than ever. As we got people closer to the problem the demand that we had not seen or been able to notice previously increased quite a bit. This is a good problem to have because you start to understand more of what you are trying to do and you get more opportunities to innovate around it.
If you are looking at a traditional org oftentimes these specialist teams are spending time justifying their own existence. They invent their own projects and they do things no one needs. These days at DRW we find that we have long shopping lists of deep unix things that we actually need. The Unix specialists are now constantly working on key useful features. We are always looking for more expert unix admins.
The last lesson learned, a painful lesson, is that “people have to buy in”. The CIO can’t just walk in and say you have to start doing DevOps. You can’t force it. We made a mistake recently and we learned from it and turned it into a success. A few months ago we were looking at source control usage. The infrastructure teams were not leveraging this stuff enough for my taste among other things. I said, we need to get these guys pairing with a software engineer. I forced it. It went along these lines: the person doing the pairing was not teaching the person they were pairing with. They were instead just focused on solving the problem of the moment. The person being paired with was not bought in to even doing the pairing in the first place. People resented this whole arrangement.
We took a hard retrospective look at this and in the end we practiced iterative agile management and changed course. I worked with Dan North who came from a software engineering background and who also had a lot of DevOps practice. A key thing about Dan is that he loves to teach and coach other people. The fact that he loved coaching was a huge help. Dan sat with folks on the networking team and got buy-in from them. He got them invested in the changes we wanted to make. The head of the networking team now is learning python and using version control. Now the network team is standing up self service applications that are adding huge value for the rest of the organization and making us much more efficient.
Some lessons learned from the technology
Ok, so Seth has covered a lot of the cultural bits and pieces. Now I [Chris Read] will talk about the technical lessons or at least lessons stemming from technical issues. To follow are a few examples that have reinforced some of the cultural things we have done. The first one is the story of the lost packet. This happened within the first month or 2 of me joining. We had an exchange sending out market data, through a few hops, to a server that every now and again loses market data. We know this because we can see gaps in the sequence numbers.
The first thing we would do is check the exchange to see if it was actually mis-sequencing the data. Nope, that was not the problem. So then the dev team went down to check the server itself. The unix team looks at the machine, the ip stack, the interfaces, etc… they declared the machine fine. Next the network guys jump in and see that everything is fine there. The server however was still missing data. So we jump in and look at the routers. Guess what, everything looks fine. This is where I [Chris Read] got involved. This problem is what you call the call center conundrum. People focus on small parts of the infrastructure and with the knowledge that they have things look fine. I got in and luckily in previous lives I have been a network admin and a unix admin. I dig in and I can see that the whole network up to the machine was built with high availability pairs. I dig into these pairs. The first ones looked good. I look into more and then finally get down to one little pair at the bottom and there was a different config on one of the machines. A single line problem. Solving this fixed it. It was only though having a holistic view of the system and having the trust of the org to get onto all of these machines that I was able to find the problem.
The next story is called “monitoring giants”. This also happened quite early in my dealings at DRW. This one taught me a very interesting lesson. I had been in London for 6 weeks and lots of folks were talking about monitoring. We needed more monitoring. I set up a basic Zenoss install and other such things. I came to Chicago and my goal was to show the folks here how monitoring was done by mean to inspire the Chicago folks. I go to show them things about monitoring and I was met with fairly negative response. The guys perceived my work as a challenge on their domain. My whole point in putting this together was lost. I learned the lesson of starting to work with folks early on and being careful about how you present things. It was also a lesson on change. It is only in the last couple of months that I have learned how difficult change can be for a lot of people. You have to take this into account when pushing change. Another bit of this lesson is that you need to make your intentions obvious – over-communicate.
We actually think it is ok to recreate the wheel if you are going to innovate. What is not ok is to recreate it without telling the folks that currently own it. – Seth Thompson.
The next lesson is about DNS. This one was quite surprising to me. It is all about unintended consequences. Our DNS services used to handle a very low number of requests. As we started introducing DevOps there was a major ramp up in requests to DNS per second. We were not actually monitoring it though. All of a sudden people started noticing latency. People started to say “hey, why is the Internet slow?”. Network people looked at all kinds of things and then the problem seemed to solve itself. We let it go. Then a few weeks later, outage! The head of our Windows team noticed that one host was doing 112k lookups per second. Some developers wrote a monitoring script that did a DNS lookup in a tight loop. We have now added all this to our monitoring suite. Because the windows team had been taught about network monitoring and log file analysis, because they had been exposed, they were able to catch and fix this problem themselves.
Quick summary of the lessons
Communication is very key. You must spend time with the people you are asking to change the way they are working.
Get buy-in, don’t push. As soon as you push something onto someone, they are going to push back. Something will break, someone will get hurt. You need to develop a pull – they must pull change from you they must want it.
Keep iterating. Keep get better and make room for failure. If people are afraid of mistakes they won’t iterate.
Finally, change is hard. Change is hard, but it is the only constant. As you are developing you will constantly change. Make sure that your organization and your people are geared toward healthy attitudes about change.
Question: Can you talk a little bit more about buy-in.
Answer: One of the most important thing about getting buy-in is to prove your changes out for them. Try things on a smaller scale, prototypes or process or technology, get a success and hold it up as an example of why it should be scaled out further.