Embrace Change in Production to Improve MTTR

DevOps creates value through constant deployment changes. When done right, more frequent deployments allow teams to continuously improve their products or services while making it easier to manage the effects of the change. One way to do this is with continuous integration and continuous deployment (CI/CD). CI/CD helps to reduce the risks to each build and allows teams to release valuable features quickly and safely. But CI/CD alone is not sufficient for understanding change and service impact in production. The recent DEJ State of IT Performance Report found that change is the largest source of production issues. A full 76% of all performance problems can eventually be traced back to changes in the environment but 67% of organizations don’t have the ability to identify changes in their environment that caused performance issues.

In a modern IT environment, different teams are constantly integrating new features and new technologies and renewing their stack. With many moving parts, it’s impossible to mimic the production environment when testing new features in pre-production. By the time you’ve implemented and tested new code, your branch will be different from what is in production.

Here’s the reality we’re working in:

● Integrating testing in CI/CD pipelines will not keep bugs out of production. Some organizations don’t do true CI/CD, meaning some processes are not even tested and automated in the pipeline.
● User configuration errors are common. Here’s a consolidated list of configuration errors that brought down production.
● Change anxiety is real: Ops folks get nervous when hitting the deploy button, even for small changes.

Changes are inevitable in DevOps, but organizations can protect themselves from a performance incident or an outage by integrating change in production safely with a few simple steps.

Build Data Connections

Today, data is kept in silos, which is why it’s time-consuming and challenging to make associations and leverage them to find the root causes of performance issues. DevOps teams need to connect their data in meaningful ways to troubleshoot complex issues in distributed systems. Building the “connections” foundation is critical to understanding the state of a system and how the pieces interconnect and interact.

How well an application performs in production is related to a combination of code, infrastructure and operations. For instance, when an incident occurs, teams should quickly know if that issue is related to code, infrastructure or operational activity. Understanding these connections and correlating the data cannot be performed manually. We have hit the wall in linking together data silos; we just don’t scale as software does. Teams need to consider a monitoring and analytics approach that connects operational data together from the start and models the causal links between the data for real-time root cause diagnosis.

Understand Change Impact to Avoid Outages

How well can your DevOps team identify the key changes made to your production environment? If your answer is “not very well,” then you’re missing critical data that helps to speed up troubleshooting. The lack of visibility into changes and how they impact production makes it impossible for operations staff to identify the cause—the very thing that SREs care most about.

Changes are made to the code or the cloud infrastructure, either directly or through infrastructure-as-code (IaC) templates, for a variety of reasons. The risk of misconfiguration and performance issues increases when these operational activities are not continuously monitored and correlated across the stack.

Metrics don’t tell the whole story. Logs don’t pinpoint where to act. Simply collecting these data types leaves many companies struggling with knowing what is important and what’s not when it comes to real-time troubleshooting.

Having a shared “system of record” for changes for all the different streams of activities increases visibility and control of all changes in real-time. Teams can see pre-incidents and be aware of increased risks when changes are made, especially during high-velocity change periods. By seeing how a code change can impact things down the line, developers can proactively troubleshoot and debug code.

Implement Real-Time Root Cause Diagnosis

In the same DEJ report, organizations shared that 66% of MTTR is spent on the ‘diagnose’ stage. This correlates to the 67% of teams that are unable to identify the changes in their environment that caused performance issues.

With a shared “system of record” for changes, DevOps teams can troubleshoot incidents in real-time, dramatically cutting the time spent on the “diagnose” stage of MTTR. This is especially effective when teams can see all the changes and other events that are happening to the system in a single, searchable/filterable real-time feed; this timeline would provide the potential causes of issues.

The best tool would allow you to start from an effect and work backward, chaining multiple events overlaid with observability data to determine the causes quickly. Building the causal chain from the end helps teams hone in on the connected change and answer each “why” in a straightforward way. This capability makes it easier and faster to prioritize relevant changes and resolve performance degradations or outages with actionable context.

Communicate Everything

During diagnosis and resolution, it’s important to see what’s being done without having to ask on Slack, wait for each other or unknowingly interfere with remediation actions. The best option would be leveraging the same timeline to capture all the context of an incident at the moment it occurs, and at any moment in time during diagnosis and resolution, to avoid reverting each other’s changes.

While an incident postmortem brings teams together to gather data and discuss the details and learnings from it, this process is often inefficient and time-consuming. Each team member involved in the incident response effort would have to recollect what they observed, what actions they took and what they communicated to other team members. What if a system could capture all the actions and communications during an incident response effort? Wouldn’t it be nice to just summarize your thoughts in one sentence and be done with the reporting?

When preparing for planned and unplanned changes that can impact the bottom line, DevOps teams need to consider these steps to gain true observability of their operations. Focus on finding the connections in the data to explain cause and effect. Understand change impact —as those changes are being made —to avoid outages. Capture and communicate activities as seamlessly as possible and identify and capitalize on opportunities to automate the workflow.