We recently spoke with Nigel Kersten, CIO at Puppet Labs, regarding high-performance IT and trends he sees in continuous integration in his role at Puppet Labs. Nigel joined Puppet Labs from Google HQ, where he was “Puppet Master” and responsible for the design and implementation of one of the world’s largest Puppet deployments. Kersten is also an experienced Linux and Mac deployment sysadmin.
In case you aren’t familiar, Puppet Labs provides IT automation system configuration and management software, which is used by start-ups to very large enterprises. Here is our discussion with Nigel, where we talk the recent Puppet Labs DevOps report and high-performance organizations.
DevOps.com: In the 2014 State of DevOps Report, there seems to be a relationship between high-performance IT and high-performance companies. Can you see the causation and what correlation is there? Is it high-performance companies driving high-performance IT, or is it high-performance IT driving the high-performance company?
Nigel: At the moment, it’s a correlation in terms of the amount of data we have. What the data actually show us from the survey is that adopting DevOps practices leads to a higher performing IT organization. This means you’re releasing code more frequently, your recovery time for failures is much, much faster, and all those sorts of things that we know go into what makes a great IT organization.
We’ve actually been able to correlate that those organizations with higher performing IT teams are outperforming their competitors in the market, and they’re exceeding their own corporate goals. Now, at the moment, we’re just being a little bit tentative in saying that that’s a correlation rather than causative.
To be clear, statistically we have a really high degree of confidence that it’s actually a causal relationship, not just a correlation. But we’re not quite ready to make that claim yet.
DevOps.com: We wonder how you’re seeing continuous integration change the relationships among security teams, application teams, operations, and QA. What are you seeing, broadly, when it comes to how these relationships are changing from the way they were several years ago?
Nigel: I think that a couple of things are changing those relationships. One has been the slow death, I would say, in some ways, of waterfall approaches in favor of agile. That’s not to instantly label agile as good and waterfall as bad, because I actually think there are some development environments and problem spaces where a really strict waterfall-based approach is the right way to do things.
But we’re definitely seeing a mass of organizations that are moving to a more agile setup. I think what this has meant is that the demands on those continuous integration systems and QA teams have changed quite a lot. Developers expect much more self-service, and they expect to be able to operate with a much tighter feedback loop. They don’t want to have to commit code and wait till the next day – until a 12-hour set of tests has finished running – before they find out whether their code is any good.
Prior to this, I think what we saw for a couple of years was developers starting a sort of shadow QA on their own using Vagrant, using local virtualization. But definitely in the last year or two, we’re seeing a trend where QA teams are being forced to provide faster feedback and more self-service to the development organization.
DevOps.com: That must give heartburn to those who are used to having strict, steady, sequential processes in place.
Nigel: It’s interesting, isn’t it? Some of this is a little speculative, because a lot of security folks tend to be very forward-thinking people, and I don’t think that’s necessarily true of everyone in the security industry, just like every industry. But I think it’s forcing people to look at their current security practices and work out what is just security theater and what actually improves the security of the product, or whatever it is that they’re shipping.
So how can we automate those things that are actually improving security so that security doesn’t become the bottleneck in these fast feedback loops. agree that it can definitely give security people heartburn, especially people who are really used to rigorous code change approvals and the sort of segregation of duties that you put it in one of your pieces.
What I see is a common compromise here, and I actually think this is a really effective way to do it. For certain environments, you really want to maintain what goes into production and have a strict change approval process. But it should be a fast and sort of appropriately light process.
We’re seeing people giving developers the ability to run tests, change the testing environment, and have a significant degree of autonomy over the whole stack in pre-production continuous integration. People are focused on giving developers the ability to raise classes of machines, modify the tests that are running on them, changing all of those things so that they get a really fast feedback loop. But then when it comes to the actual merge to production, that’s something that your existing type change control process can take care of.
This means that we see a lot of people segregating their infrastructure into dev, test, and prod, as far as continuous integration goes.
And the developers do a lot of dev work locally on VMs. When it comes to test, that’s often on a shared infrastructure that, using something like Puppet, can be configured to be exactly the same as the production system. And that’s good enough for the developers. They’re getting their tight feedback loop. They’re testing code in pretty much the same way as it’s going to be tested in the official continuous integration pipeline, and yet you can still manage tight change control process around the merge from test to production.
DevOps.com: And if it works as planned like that, you would think you’d get improved code outcomes because the developers are working much closer to any defects.
Nigel: Absolutely. And I think that’s the whole promise that this sort of system is meant to provide. I think it does fall down if you don’t have a very tightly disciplined development organization, and it’s just open season on the production infrastructure. In that case, it actually becomes counterproductive, because the feedback loops become longer and longer. Then, if the test fails, you’re never quite sure if it’s due to someone else messing with the system that’s telling you whether it passed or not.
DevOps.com: It sounds like they are self-correcting problems over time, because security often uncovers “hidden” or latent problems that you don’t see until something bad happens. But if it were a full onslaught on production, then I would imagine you would have other issues with availability and performance and related quality issues would mount up very quickly.
Nigel: Absolutely. I think this is one of the reasons why people drive a lot of this work to the cloud, and I’m speaking from experience here, because the sys ops team I manage actually manages the infrastructure for the continuous integration system here at Puppet Labs. We have a pretty complicated testing matrix, and clearly security is a really big concern for us, because Puppet in many ways has the key to the kingdom when it’s deployed on someone’s infrastructure.
So it’s really hard for a small company to provide enough internal hardware infrastructure to allow for that sort of elastic peak consumption. I think this is one of the big drivers toward people, at the very least, using the cloud to smooth out their internal hardware requirements so that, at times of peak capacity, they can essentially burst out to the cloud.
DevOps.com: Organizations that move this route, I assume, need to get to know their own processes very well before they move from what they were doing before – more manually driven processes – to continuous integration?
Nigel: I think what we tend to see is that it ends up being on a project-by-project basis. If you’re a company that sells on-premise software or software as a service, you’ve got a reasonable amount of investment there, and it’s rare that companies can take the time to just pause the delivery of software. Usually, they need to keep making money; they need to keep shipping releases.
So, what we tend to see is that people start experimenting with green field environments. This, according to my gut, is often enabled by people moving to more service-oriented architectures, where rather than shipping large, monolithic applications, they’re shipping components that have well-defined boundaries with each other.
As part of that refactoring goes on, as they build a new component, they might do the testing and continuous integration for that component in a fundamentally different manner, and sort of step through their code base that way.
Eventually, you have to deal with the brown field deployment. But, ideally, by that point you’ve gotten a whole bunch of huge wins from the green field continuous integration systems that make that sort of return on investment obvious.
DevOps.com: I would imagine that knowledge gained from those groups is then shared with small workgroups that experiment and then there’s a lot of knowledge sharing and expansion organically from within.
Nigel: Yes. And with people being people, this is often where turf wars strike up. The have been a number of times when we’ve sort of walked into organizations and talked to the QA team and the developers and the operations people – and dev and ops have started working together and they can ship something much more quickly by doing things in that way – and yet somehow the QA team feels slighted that it’s somewhat bypassed by this whole process. That can be a rather complicated conversation to be in the middle of.
DevOps.com: Do you see companies make common missteps when they move toward continuous integration deployment?
Nigel: One of the more conceptual mistakes I see people make when they’re considering it is that they confuse continuous deployment with making lots and lots of public releases, and they’re not necessarily the same thing. I often try to tell people that continuous deployment is really about enabling the ability to choose to ship something. It should be a shippable unit that shouldn’t necessarily be shipped, if that distinction makes sense.
A lot of companies will start stressing out. They’ll wonder about their existing processes and educating their channel resellers. They’ll believe they simply can’t afford the cost of a release every two weeks, or every week. But that’s not necessarily the whole point of continuous delivery. By having artifacts that are shippable, you enable testing internally, you enable easier experimentation, and you still can choose to ship your product only once every six months if you want. But the fact that you might have 20 releases in between those public releases is important. They could all be shipped and could be a bug fix for a certain customer, or enable you to do certain types of development in the future. These are really huge wins, even if end users never actually see those specific releases.