DevOps Best Practices: Break Down Silos, Avoid Drift and Optimize for Flow

In organizations trying to achieve agility with stability and quality, silos (of teams, tools, and processes) create major stumbling blocks to managing configuration drift between pre-production and production systems. Breaking down those silos, improving visibility into configuration drift and facilitating collaboration for improved flow are the keys to overcoming this challenge.

The challenge of staying in sync when both sides are moving

Production environments aren’t static. Whether you are applying patches for improved security, optimizing settings for availability and performance, or making changes for a technology refresh, production systems are continuously evolving. As new releases are being developed, tested, and staged, pre-production environments are continuously evolving too. Ideally you want to develop with production in mind, but that gets harder and harder when both sides are changing and different teams are responsible for different environments and live in their “own worlds.” (e.g. the common split between QA teams, release teams and production operations).

Even within pre-production, staying in sync is a significant challenge. For example, because applications ride on middleware and middleware settings are unique to each application, keeping systems in sync from QA to staging can be a real challenge.

QA and staging environments are regularly reconfigured to support multiple releases that are in process at the same time. The costly way, often employed to minimize changes, is to “protect” QA environments for a certain application or release leading to low utilization and bottlenecks. The rapid proliferation of mobile applications to front end legacy systems magnifies the challenge by adding another dependency layer.

Disconnected teams, information, and perspectives (those pesky silos)

Why are these challenges so hard to overcome if we know the issues? Many IT teams operate in their “own worlds” with tailor-made but disconnected tools and information resources. Silos have evolved because they permit the specialization required to deal with the immense, growing complexity IT deals with everyday, but they have now become the biggest impediment to agility and quality.

For example, the operations team has access to systems management alerts that most others don’t, engineers have access to build systems and scripts that others don’t, developers have access to functional requirements and stories that others don’t, and change approvals are stored in a system of record that only certain people (often managers) have access to – just to name a few silos.

Coordinating work and sharing information across these silos of teams often equates to lots of meetings, conference calls, and email threads. This is not the most efficient way to get work done and it’s not particularly effective in sharing information to keep environments in sync.

“Didn’t you see my email from yesterday with the attachment listing all the middleware settings required for today’s test?” “Weren’t you in the meeting three weeks ago when I said a Unix server was probably going to be needed in addition to a Windows machine?” “I told so and so in an instant message on Friday to run this script before he went out on PTO, but he must have forgotten. Who else has root permission to run it while he’s out?”

Sadly, these kinds of statements are the norm, not the exception, in most IT shops.

The reality is there is inefficient coordination to manage all the steps and handoffs between application owners, development, QA, release and production functions. Instead, I recommend thinking about the software development and deployment lifecycle as an “uber change” and “uber collaboration” process that must be viewed and managed from end to end; from business requirements, to development, to QA to staging to production. All of the different silos of teams must have a way to work together in a unified way with access to critical information from upstream and downstream teams to ensure a smooth lifecycle end to end.

Automation ≠ DevOps

The heart of DevOps is collaboration and managing change, yet the conversation often seems to start with automation. The fallacy here is the belief that we can eliminate the need to collaborate and manage change by removing people from the process and relying on the machine. This out of balance perspective between the machine and people creates more problems than it solves.

Automation tools, while important and necessary, are not all that’s needed to truly solve these synchronization and deployment issues. In addition, to have environments only changed by automation tools and never by people during the firefighting of incidents is unrealistic. You need your people armed with intelligence and tribal knowledge to drive your automations, but all too often we rely on individuals with disconnected perspectives and limited awareness so things get missed or overlooked.

Automation does what it’s supposed to do, but it makes a critical assumption; what’s being changed is configured how it’s supposed to be configured. If drift has occurred because critical settings were modified to remediate an incident or compliance issue, without the automation engineer aware of the change, the automated deployment will result in you hearing a loud “boom!”

As one popular DevOps Twitter handle has said, “To make a configuration error is human – to do it across a thousand servers is DevOps.” While that may seem like an extreme statement designed for comedic effect, it isn’t extreme to say that thousands of deployments go south every day because systems aren’t configured the way we expect them to be given that the complex software stacks that multi-tiered applications now depend on are touched in so many ways. Automation alone doesn’t solve the DevOps challenge.

Negative consequences to business

Of course, the issues I’m raising here aren’t just IT issues; they are real business issues. Time to market is typically the number one impacted area. For example, if a two-week development sprint is used to get a business critical need addressed, but then it takes weeks and weeks to get the sprint through the quality cycle and into production due to resource constraints and configuration issues, what have we achieved?

Stability and quality are the next big area of impact. When change is rapidly introduced through new releases and unexpected drift occurs in environments, performance and availability impacts occur. It is one thing for IT to be perceived as slow and unresponsive to the business, but it is quite another when they are perceived as slow and quality is poor resulting in business impact.

Just when IT thinks it can’t get any worse, the business asks a presumably simple question, “What happened?” IT’s typical response, “Let me get back to you on that.” And then begins a scramble of hours, days or weeks as IT’s best people desperately try to figure out what went wrong. This is obviously not the best confidence builder for the business.

Compliance is typically the other casualty as there is often very little awareness of operational policies on the development side of the house so deployment packages and instructions aren’t sensitive to compliance issues. Building and testing an application without full awareness of policy governance across the whole, multi-tier stack is bound to introduce delays and errors when attempting to deploy in production.

Visibility, configuration transparency and collaboration; a better way!

Understand drift for the entire application service. Rather than everyone having their own disjointed perspectives on databases, app middleware, network or firewall, and server configurations, provide team members with an understanding from the business service (or application) perspective with direct visibility into the differences between how things should be configured and how they are actually configured for the entire multi-tier application stack. This enables you to create smarter automations and initiate change processes to bring configurations back in line when necessary rather than waiting for a failed deployment to occur, identifying the root cause, and correcting it while the business waits.
Optimize for flow not individual processes. As stated in the previous example, making one process faster (e.g. development) without considering its impact on QA, staging and production will generate new constraints downstream. Instead, view DevOps as one continuous process from goals and requirements all the way through to deployment into production and define success for the team in those terms. When you optimize for flow instead of silos, you will be amazed at how teams share more and collaborate better.

The benefits of breaking down silos through better visibility, understanding drift and collaboration are significant:

faster time to market;
quicker ability to react to competitive pressures;
deployment with fewer errors;
better compliance with policies; and
greater productivity for everyone.

To sum it up, if you break down silos, identify and resolve drift, and manage for flow (while avoiding islands of automation, information, and people) you can unlock agility, maintain stability and ensure the quality your business demands.