The operations team receives a page at 2:00 a.m. Something isn’t working correctly. Do they know how to fix it? Perhaps more importantly, do they know whether it matters to customers or how it impacts the business? There are different techniques that companies implement to maximize the quality of their software. We talk about shifting left or right, chaos engineering and others. While they each have their own goals to improve quality, there’s an underlying common goal as well: All three techniques increase your team’s knowledge of the system and their ability to resolve problems quickly.
Here, we’ll walk through the software life cycle and discuss the common challenges to software quality, running complex systems and managing dependencies between teams. The common thread is a proper observability strategy which can help software quality, hand-offs between different teams and resolve problems quickly by ensuring whole-system resiliency from design to maintenance.
Techniques for Improving Quality
Shift left is a concept that has attracted much attention recently. The practice revolves around the idea that detecting and correcting problems sooner minimizes their impact. As the thinking goes, they’re easier to fix, they wind up costing less and there are fewer ongoing issues with bad code.
The rise of DevOps and DevSecOps has propelled the popularity of shift left. Yet, there’s a fact that development teams often miss: no matter how early an organization spots errors, problems will inevitably remain. Early testing can’t catch everything. As a result, some organizations have begun to rebel and move to a shift-right framework. It normalizes errors and aims to better resolve issues across the development process, including post-production.
Shift right is the perhaps less-extreme cousin to its rebellious relative, chaos engineering. With chaos engineering, team members cause problems on purpose to keep everyone on their toes and test the limits of the system. They shut down servers, disconnect wires, delete files, etc. The goal is to confirm system resilience–from a user perspective, there should be no impact even with all of this chaos. A resilient system means that teams can fix failed components at their leisure during the day. With a resilient system, critical failures are less likely to happen and it becomes second nature to fix them when the dreaded 2:00 a.m. page does occur. Unlike chaos engineering, shift right doesn’t explicitly break things–it just makes the assumption that broken things will exist so teams need to be ready to handle them.
Both concepts (and yes, even chaos engineering) are valid—but each company needs to carefully analyze which is right for them. One variable to consider is the complexity of the system you’re building. Working on a rocket launch or a surgery robot? Probably should shift left as much as possible. Changing the color of a button? Maybe it’s not the end of the world to verify the fix in production. For most of us, projects fall somewhere between these two extremes–so sometimes shifting left is necessary, and other times shifting right is pragmatic. A careful analysis of the risks should be performed to determine what kind of testing is needed.
Although teams tend to be pretty good at analyzing the complexity of their software and making development process decisions, they often forget the complexity of other companies’ code which they depend on. Even a simple web application deployed today has tons of dependencies: The server it’s running on, the local ISPs, transit ISPs, load balancer, DNS provider, TLS certificate, CDN and browser, just to name a few! After analyzing the complexities of all of the dependencies across the internet, it becomes clear that finding every possible issue that might impact their application is a daunting task.
The goal, again, is to find any issues before they reach production. Or if they reach production, fix them before they impact customers. Or if they impact customers, fix them as soon as possible after that. To do any of these things quickly, the level of observability of the overall system is paramount. With simple systems—or even complex systems that are entirely under one company’s control—tools such as application performance monitoring (APM) are common. But when the scope of the system is as big as the internet, internet performance monitoring (IPM) tools are necessary instead.
With both APM and IPM tools, the goal is to know as much about the system as possible – enabling teams to identify problems quickly, pinpoint the root cause, and fix it. When an organization is equipped to monitor development and operational processes effectively from application to delivery, there’s a shift toward higher quality code, fewer errors, improved security and lower costs for the enterprise. There are also fewer disruptions and distractions for engineers, DevOps teams and others.
Dealing With Disconnects
Whether shift left or shift right is used, there will be problems that enter production. There is an invisible wall in many organizations–where engineering teams “throw code over the wall” to the team which is now responsible for running it. In some companies, this is the operations team; in others, it’s site reliability engineering or production engineering–the common trend is that in most companies, it’s not the same team who created the software.
These operations teams often lack the nuanced information they need to resolve production issues. They might spend hours or even days attempting to track down the source of the problem. There’s often finger-pointing between different teams who created or own different portions of the system. Things can quickly devolve into chaos—and tempers can fray. After all, these kinds of problems always happen at 2:00 a.m., right?
That’s where end-to-end observability enters the picture. Imagine a framework with proof that the development team has performed sufficient knowledge transfer to the operations team. That the operations team has asked all of the relevant questions about running and maintaining the application. Using the right tools, engineering and operations teams need to monitor every aspect and every component of the system. Creating this monitoring strategy means that everyone involved also builds an understanding of the system–everyone is on the same page. All of a sudden, there is a common language to discuss problems, and teams operating the software in production can take debugging a lot further before they need to ask for help.
Ultimately, shifting left or shifting right is a personal choice for every organization. Even if every possible test case is executed before deployment to production though, issues will happen–so take the opportunity to implement a robust system-level monitoring strategy that can be used to solve technical questions during that 2:00 a.m. call.