Continuous Testing System Stability

The effort to speed up continuous testing will be wasted if the underlying system that continuous testing depends on is not stable. In my prior blog
Continuous Test Results Analysis – at the Speed of DevOps, I discussed the importance and practices for speeding up continuous testing results analysis. However, unless the continuous test system that the test results are derived from is rock solid, the results will at best be unreliable and at worst – results may not be generated at all!

One of the first rules of DevOps is to “keep going and don’t stop”. Stopping is the equivalent of pushing the red button on a factory assembly line effectively causing the entire line to shut down. Then, panic ensues because every second that the line is down affects all of production leading to increasing costs not to mention restarting the line is substantial also.

Many techniques are employed to minimize the need to stop testing despite reported product test failures. However if the underlying continuous testing system itself is not stable, then the entire DevOps system including continuous testing will immediately be interrupted until the system recovers. Development team members are looking for results, so the time savings for speeding up the availability and the total work associated with results analysis is huge for every minute of results analysis time saved on each CI/CT cycle. When aggregated over multiple code branches, a few minutes saved per test result can be man-years of time saved over a year!

So what can be done to ensure the continuous test system itself is stable and produces reliable results even in the face of inevitable occasional system failures? After all, there is no such thing as a truly perpetual machine that never ever breaks down. Below are some suggestions in a checklist format that have proven useful in high performance DevOps deployments.

Build a fast-response multi-discipline team: There are a wide variety of ways any system of interconnected computers and applications can fail whether they are virtualized, internal, or running on a cloud used remotely from a data center in the cloud. The following list of suggestions for the team will assure fast recovery when the system does experience failure.
• Assign a continuous testing system architect with responsibility to design system stability SLAs, design high-reliability system architectures and define sufficient team workflows suitable for disaster prevention and recovery consistent with recovery time SLAs.
• While the team members may be in-or-out sourced, full or part-time, there needs to be clearly responsible first responders and 2nd line escalation team members with definitive workflows that take care of failure events 24/7 for every hardware, software, tools, network and firewall component that the system depends on.
• Assign product team leads to plan changes needed for new products and services.
• 24/7 availability system designs are pricey to build, expand, and maintain, so finance and accounting team members need to be part of the team to keep clear, current cost data relevant to the continuous testing system.
• A senior budget manager approves expenses needed for stability without delay.
Anticipate failures – prepare, prepare, prepare!: Don’t wait for a failure before testing system recovery procedures! Documented disaster prevention and recovery plans or high-availability tools won’t save the continuous test environment if the team is not ready to react during a failure event.
• Run failure simulations for each defined failure scenario at a frequency that matches the expected probability of each failure.
• Cross-train team members on system recovery best practices so the skills will be ready when most needed – ensure no one person has essential unique recovery skills.
Choose test tools designed with high stability system administration capabilities: Look for the following features designed into in test tools to ensure the continuous test environment can be configured and managed for high reliability.
• Tools are implemented as autonomous services, controlled through RESTful APIs, that can be separately invoked and managed
• Test system service monitoring is built-in including up-time, wait-time and usage metrics for each use case of each test system service.
• Diagnostic capabilities are built-in to each service. When a failure occurs the failure log includes helpful descriptions including causal information.
Engineer the system for stability: Typical continuous testing systems have many “moving parts” and many of them have to work in concert for the system to perform properly. It is common engineering knowledge that the reliability of a system is determined by the reliability and configuration of its parts. Here are some suggestions to improve the overall system reliability.
• Choose powerful servers and virtual machines (VMs) with ample processor speed, fast I/O, large and fast memory and data storage. This is important to minimize the chance of system failures due to possible overloading of these resources.
• Deploy multiple servers and VMs for each system component with enough redundancy so that no one component failure will interrupt the entire continuous testing system.
Configure the test workflow to avoid false negatives: There is no point in reporting product test failures to users if the reason for the failure was an underlying continuous testing system failure instead of a failure of the product itself. Such “false negatives” are a big waste of time. Here are some suggestions to reduce the chance of false-negative test reports.
• Assign an application to continuously verify, in-real time, that all the system components and interconnections are operating and report these results to the continuous testing report aggregation tool. These results are called CT system environment test results.
• Use the CT system environment test results within the continuous test results analysis tool to determine if any test failure coincides with an environment failure. Mark each verdict “inconclusive” instead of “fail” if it does and include a description of the environment failure in the verdict log.
Maintain the system to assure long-term stability: Every system needs maintenance at some point. Hardware components wear out and software components become obsolete. Each time a system component is replaced or upgraded to new versions there is a risk of instability. Here are some recommendations to reduce the risk of introducing instabilities to the continuous testing system when changes are implemented.
• Create a maintenance environment that mirrors the test and production environments.
• Define a test suite that will exercise all system components and topologies.
• Test changes in the maintenance environment before deployment.
• Occasionally run failure/recovery simulations in each production environment.

The above is a partial list of suggestions for continuous test system stability that have been proven to yield good results for DevOps. At Spirent we think testing has a bright future in DevOps. You can read more about our views at Spirent.com/solutions/devops

What do you think of these suggestions and do you have others that should be mentioned?