DevOps Needs Collaboration and a Safe Place for Success

When it comes to DevOps adoption, “[many organizations] tell a similar story,” said Scott Van Kalken, systems engineer at F5 Networks. It takes more time and effort to develop the right culture than you probably expect.

Van Kalken has a long history in tech, including development and operations roles, and has spent several years practicing DevOps. He is part of the open source and meetup communities, and organizes some of the largest DevOps events in Australia.

“DevOps typically starts in a corner of the organization, where people want to deliver high-quality software really quickly,” said Van Kalken.

When that team gets good results, the organization decides to scale up its DevOps efforts, but tradition and silos get in the way.

“At least two things are needed to overcome those obstacles and achieve DevOps success,” suggested Van Kalken.

Collaboration and Safe Places

The first thing you need to achieve DevOps success is a collaborative team environment. One of the central ideas of DevOps is you don’t throw software over the wall between functions, such as development and testing, so there needs to be an understanding and acceptance that everyone in the team is on the same side.

Second is the provision of a safe place to fail and the realization that failure is just an iterative step on the path to implementing the best ideas.

A financial services company that Van Kalken works with was adopting GitOps but needed to convince the service management team this was the right move.

“Initially, it was quite challenging because service management thought it [GitOps] was uncontrolled chaos, even though that isn’t the case,” he said.

Its support was gained after running a series of workshops that showed the underlying processes were actually the same, even though the tools used to implement them were changing.

Service management is still the final gatekeeper, but it now approves releases in the repository rather than on paper.

Maturity

“There is a maturity in accepting change does involve risk,” said Van Kalken.

Things such as GitOps provide a quick and easy way of reverting to a previous release if something goes wrong. So the “safe place to fail” idea can be implemented in a way that protects the organization in a practical sense as well as staff members in a psychological (and career protecting) sense.

Another way DevOps can protect the organization is that frequent releases generally involve relatively minor changes, which in turn have a smaller blast radius in the event something goes wrong. Again, this is about having the maturity to recognize risk and dealing with it, rather than trying to avoid it in the first place.

A sometimes overlooked aspect of increasing the release cadence is the effect on users, who are being asked to cope with frequent changes.

Van Kalken pointed out that increasing the frequency does not necessarily mean multiple releases a day–it could be from two releases a year to three or four, if that’s what suits the organization.

Not all changes have a direct impact on users. Some are under the hood, improving performance or addressing rarely-encountered bugs. But when users’ interaction with the system is changed, he suggested canary deployments as a way of checking the acceptability of the new approach among a larger pool of users than those brought into the development process, before it is released to the entire user population.

“This approach also has a place in DevSecOps as another way of limiting the blast radius,” he said.

Accepting Failure

Perhaps the biggest challenge to an organization’s culture is the adoption of chaos engineering because if you’re going to kill a container or flood a network with data in order to check that the wider system can cope, you absolutely need to be in a safe place to fail.

You also need to realize everyone involved–including developers, security people and those involved in the business side–wants to achieve high availability, and that means designing systems that can handle (partial) failures.

It isn’t enough to do this in pre-deployment testing. Ongoing testing provides confidence that production systems will continue to cope with such failures.

“It’s pretty cool when you get it right,” said Van Kalken. However, “your culture has to be OK with this iterative approach.”

“DevOps is really just a methodology to achieve a better outcome,” stated Van Kalken. That involves collaboration, iteration and embedding security and other considerations up front. But the wider organization needs to understand and tolerate this, and the risk that goes with it.

“Education, awareness and experience can all contribute to dispelling the traditional view that risk and failure are inherently bad,” he suggested.

Are the Metrics Appropriate?

A related point is that whether one outcome is better than another, depends on the metrics adopted. “[Metrics are] probably the biggest things organizations need to change,” said Van Kalken.

For example, a customer who ordered a pizza doesn’t care whether a particular system involved in taking the order, cooking the pizza and then delivering it, achieved a certain uptime target. They just want the pizza to arrive promptly.

When that doesn’t happen, he believes a collaborative approach is needed to determine why the delivery was late, exactly what caused the problem and how to quickly roll out a fix to stop it recurring.

When it comes to practical advice, Van Kalken emphasized the need to get people together early in a project, otherwise there is a risk that sub-groups will demonize each other. In addition, you get more buy-in and everyone accepts they have to help with the pedaling, not just sit back and be passengers.

“If people aren’t talking, you have a bad culture,” he said. But he acknowledged it can be difficult to get past existing adversarial relationships. Breaking the ice between teams that haven’t previously engaged with each other can also be tricky.

Returning to an earlier theme, he stressed the importance of adopting an analytical, rather than punitive, approach when failure occurs. Everyone involved must know the organization accepts risk and understands the consequences.

— Stephen Withers