Injecting failure into your infrastructure to test your services resilience has been gaining popularity. Most people think of Netflix and their use of “Chaos Monkey”, in fact, they have an entire “Simian Army” of tests and drills they run to make their systems more reliable. At PagerDuty, we couldn’t just copy what has been published about their practices. Instead we needed an approach that would work for us. We have a relatively small infrastructure compared to Netflix and wanted to develop a practice that enables a team of our size. In response to this need, we created Failure Friday.
The idea behind it is simple: We block off an hour of time and deliberately shut down certain pieces of our infrastructure to see how our service holds up. For example, our first attack will test process failure. We’ll typically run something like “sudo service cassandra stop” command, keeping the service down for about 5 minutes. In this time, our service should be resilient enough to continue processing traffic. If not, immediate action will be required. The main goal of injecting failure into our systems, of course, is to establish what will happen in the event of a node, datacenter or network outage. We can then use this knowledge to make our infrastructure more fault-tolerant.
Benefits of Breaking Your Systems… On Purpose
On the surface, injecting failure into your systems seems counter intuitive. Why break your systems, especially on a Friday just before everyone is about to go enjoy their weekend? Since beginning our weekly failure testing in June 2013, our results have uncovered a number of benefits, which we believe give reason for all teams to develop failure testing scenarios of their own.
- Uncover implementation issues that may reduce resiliency. We have the opportunity to fix this issue before they cause an outage resulting in an alert outside of business hours.
- Proactively discover deficiencies in our infrastructure. These deficiencies may eventually become the root cause of an outage, allowing us to fix them before they break in the wild.
- Builds a strong team culture. By coming together as a team once a week, we share knowledge. Our Ops team learns how the development teams debug production issues, while Devs gain a better understanding of how their software is deployed.
- Helps us onboard new employees. Failure Friday has allows us to walk through actual failure and alerting scenarios with new hires so they can shadow and learn to handle outages at 11 AM instead of 3:00 AM.
- Adjust alerting and on-call schedules. Team members participating in failure testing should receive all the alerts that they would during an unexpected outage. This is a great opportunity to make adjustments to limit noise and also ensure you are receiving all of your critical alerts. When we get an alert, but no action was required we know we need to make adjustments to your monitoring thresholds. Or if we don’t receive an alert, but we need to take action, we need to adjust our thresholds so we don’t miss that or a similar alert in the next time.
- Reminds us that failures are inevitable. Failure is not a just a random freak occurrence, it can and will happen. All code that engineering teams write is now tested against how it will survive during a Failure Friday scenario.
Develop Your Own Failure Friday
When you consider putting together your own version of Failure Friday you must remember that communication between teams is key. Failure Friday will not likely be successful if performed by individual teams.
Step 1: Bring your teams together.
Get key stakeholders for each of your teams together to discuss each other’s testing needs and decide what to test. It will not be possible to test every possible scenario, so putting together a list of priorities that is meaningful for everyone will help you accomplish quite a bit in a short session.
Step 2: Find a time that works best for your team.
At PagerDuty, we choose Friday for a one-hour session because there tends to be fewer deploys on Friday. This way we aren’t creating any roadblocks for our business with our testing practices. Since it’s a recurring event everyone in the company can expect Failure Friday to occur and be prepared. Though Friday can seem like a risky time, having run failure testing you may catch potential threats, leading to a quieter weekend.
Step 3: Set Up a War Room
Having a common place where all your teams can sit together will allow your teams to seamless work together. They will also benefit by observing and learning from members of other teams in how they resolve incidents. For PagerDuty, we set up camp in the corner of our dining area. You don’t need to do anything fancy.
Step 4: Prepare for your first attack
Before injecting failure, we’ll disable any non-critical cron jobs that are scheduled to run during the hour. The team whose service we’re going to attack will be ready with their monitoring dashboards to track changes and be take action, if necessary. (note: cron jobs are disabled because we don’t want them to interfere with the attack. For instance, we wouldn’t want a Cassandra repair run in the middle of taking a service down.)
Step 5: Set up your communication Channels
Because communication is imperative during Failure Friday you will want to use Failure Friday as a way to also test your communication channels. Using a chat room is an essential useful way to exchange information non-verbally. Chat rooms also provide timestamps so you are able to log and review actions that are taken in order to correlate any actions to metrics you may capture during your testing scenario.
It’s All About Reliability
Failure Friday doesn’t end when we resume normal operations. From our learning we will assign action items to team members in our ticketing system (just as we’d do if a piece of our system went down for real), allowing us to take precautions in order to prevent these scenarios from occurring in the wild. By putting people in one room and throwing problems at them, we make both our team and our product stronger.