We live in a world in which software is increasingly dominant. Not only do we rely ever more on software, but its complexity is becoming even greater. And with increased complexity, bugs are more likely to be intermittent, making failures difficult to reproduce. Sometimes this non-determinism results in failures never getting fixed. Any unfixed bug is potentially a security breach or catastrophic customer outage waiting to happen. With such risks on the line, ensuring your software is of the highest quality is essential to running a successful operation.
To cope with this complexity while still providing some assurance of delivered software quality, the DevOps industry has embraced a variety of automated testing regimes. A development team of a given size is able to run many more tests than it could before and must triage, and attempt to resolve, a correspondingly larger number of test failures. Diagnosing the cause of each test failure can still take a lot of time, particularly if the problem is intermittent or otherwise not easily reproducible. In fact, developers can spend on average as much as 50 percent of their programming time debugging, not coding. Think of all the innovation that could occur if that time were reallocated.
Traditional QA
When it comes to Quality Assurance (QA), developers and testers are all aware of the sequence of deliverables, beginning with development delivering an untested product to QA, which releases the tested and approved product to the end user. However, when a bug is found, QA and Development must communicate with one another to solve the issue quickly. The effectiveness of the delivery channel between QA and Development hinges on the quality of the information about the failure.
Some common challenges raised by inefficient QA processes include:
- Complex control flow – Difficult to make inferences about how a failure unfolded
- Non-deterministic failures – Hard to reliably reproduce a failure to investigate the root cause
- International R&D – Collaboration is made more difficult by geographic separation
Conventional test systems inform that there is a bug but offer limited information about the cause. Workflows such as continuous integration (CI) often offer clues, such as informing which commit to the codebase caused the failure. However, there are several cases where CI cannot directly help identify problematic code changes. For example, the triggering code change may be entirely correct, yet cause a pre-existing bug to manifest. The developer cannot simply stare at the modification in the expectation that the source of the bug will eventually appear. Conversely, the failure may not easily reproduce for the developers even if they run the same test again. In short, CI is often very good at making the easier bugs very easy to fix, but typically is of little help with the more difficult bugs, particularly intermittent ones.
The Next Evolution of QA
To bridge the gap between identifying a failure in the QA process and resolution by development, DevOps teams have adopted a variety of technologies to complement their automated testing setups. New to this scene is technology capable of recording all or part of a program’s execution for subsequent replay and analysis, capturing test failures as they happen. The recording contains all the information a developer needs to understand the exact set of conditions that led to the test failure. It can be analyzed on different machines to the one on which the failure occurred, facilitating collaboration between colleagues who can work on separate machines to resolve the same issue—a particularly useful benefit for large software companies with geographically distributed teams. By gaining the ability to debug an exact replica of the failure, developers immediately reduce the time and effort required to diagnose and fix failures, therefore speeding the development cycle.
Being able to generate recordings of failed tests, especially those intermittent issues which might not otherwise get fixed, coupled with the productivity gains of increased collaboration between development and QA allows companies to maintain a competitive advantage in a volatile marketplace.
About the Author / Greg Law
Greg is the co-founder and CEO of Undo. Greg has 20 years’ experience in the software industry and has held development and management roles at companies including the pioneering British computer firm Acorn, as well as fast-growing startups, NexWave and Solarflare. Greg left Solarflare in 2012 to lead Undo as CEO and has overseen the company as it transitioned from the shed in his back garden to a scalable award-winning business. Greg holds a PhD from City University, London, that was nominated for the 2001 British Computer Society Distinguished Dissertation Award. He lives in Cambridge, UK with his wife and two children.