For those unfamiliar with the term, a GameDay is a dedicated day for running chaos engineering experiments on our systems with our team. The goal is cooperative, proactive testing of our system to enhance reliability. By getting the team together and thinking through your system architecture, you can test hypotheses and evaluate if your systems are resilient to various kinds of failures (and if they are not resilient, fix them before those weaknesses impact customers).
But how can we have productive GameDays in a distributed, remote context? This article highlights the differences that exist between running a GameDay with everyone in the same room at the office or data center versus running a GameDay with everyone working remotely. This post is meant to be a resource now that we’re all working from home as a result of COVID-19. First I outline the differences, then I talk about how to be successful despite these differences, and in some cases enhance your potential for success because of them.
What Makes a Remote GameDay Different?
Because all of the participants are spread out, communication will be a challenge at first, especially for teams that have little to no experience with it. Learning how to hold standup meetings, communicate about system changes and updates and coordinate work such as deployments will be a challenge.
Distractions can also be a big deal. When everyone is locked in a conference room it’s easy to run a distraction-free GameDay. It’s more difficult when we are remote and have kids and pets running around, FedEx deliveries and random family noises. Have a plan to mitigate the best you can. Some find headphones helpful, while others move into the basement or a bedroom and lock the door. Find something that works for you.
You also need to think about the risk of someone’s internet connection going down or their PC/laptop crashing. Create alternate ways to dial in to conference calls. Make sure every team member has a backup team member standing by ready to take over that person’s role if needed. When a required role is being performed by someone remotely, there is always a chance their internet connection could go down. There needs to be a contingency plan for someone to take over their role if that happens. This can be a team member who already has a role and is capable of taking on another concurrently or someone who was only participating as an observer.
For teams that are remote-first and have been working this way for a long time, much of what we outline here may be old hat. But it can be easy to fall into bad habits for anyone, so if your team was used to running GameDays weekly in the office, it’s important to keep that muscle memory even in the new paradigm.
How to Prepare for a Remote GameDay
Here are the ways you must prepare.
Define
Define the GameDay. Specify the exact chaos experiments that the team will run, and exactly which systems the team will target. Define the goal. Is it to:
- Replay/reproduce circumstances that lead to a past outage?
- Retest something that caused a problem in a previous test that you believe you have fixed and want to confirm?
- Ensure that a new system has the proper monitoring set up and working with alerts and metrics?
- Something else?
Time
Set the date with a beginning and end time. This will likely require coordination with people. Be sure to account for time zones if your team is spread out nation- or worldwide. Set a clear agenda for the GameDay that includes:
- Assigning which tasks will be done by which team member and the timing involved. This orchestration is especially useful with team members working from a distance because it frees up communication channels for use in dealing with problems or for the leader to give direction without interruption.
- Whiteboarding of the current system architecture to paint a clear picture of what you are going to attack and what you expect might be impacted. Doing this remotely requires the use of some imagination and forethought. One option is to use a diagramming app while sharing your screen over a video conference.
- A short team debate on assumptions to get everyone on the same page.
- Clearly defined chaos experiment test cases and scoping to limit test magnitude, length of time and blast radius. To prevent problems, make sure you have a halt or abort plan and can stop the experiment easily if unexpected problems arise.
- The execution of the chaos experiment. Ask questions like, “What changes are observed in monitoring data?” and “Is this the behavior we expected?” Collect and document the information.
- A recap at the end to discuss what worked, what didn’t and how to use what was learned.
People
Identify which team(s) will participate and make sure everyone on those teams is ready. Make sure all stakeholders are available and will be present in the appropriate communications venues.
- Identify the incident manager on-call (IMOC).
- Define the on-call rotation.
- Notify everyone on the on-call team(s).
Comms
Set up dedicated, real-time communication venues for the duration of the event. This is where all activities will be coordinated and where all reports will be sent.
- Create a dedicated instant messaging channel using Slack, Hangouts, Skype or whatever your company currently uses. Test and make sure all involved are invited and can access the channel.
- Create a dedicated video conference using Zoom, BlueJeans, Webex or whatever your company currently uses. Test and make sure everyone involved can access it.
Final Steps
The agenda will be the same as in an onsite GameDay. Communicate it in advance. Include information about the schedule, the activities and the roles, specifying who is responsible for what. Typically two hours is more than enough time for a bit of an introduction, a discussion of assumptions and the day’s test design.
During the introduction, the designated test leader will make sure everyone understands what will be tested, the parameters for how the blast radius is set, the magnitude of the attack to be run and the expected outcome. Another team member is assigned to monitor the metrics. Others may be assigned to monitor other systems that may end up being involved.
Communication is key, especially because if the test causes unexpected problems. Everyone needs to know immediately so that the test can be stopped and any damage dealt with right away. It’s actually a good thing if you find unexpected issues because then you can fix them.
It’s also a good thing to take time to celebrate when an experiment yields an expected result. Rejoice together that your team’s implementation of self-healing or recovery mechanisms kicked in and worked as expected. Allow yourself to take the opportunity to call out the immediate wins and not just the issues you find.
Conclusion
Running your first GameDay can be intimidating, but with good preparation, any team can be successful. Doing the same thing with everyone working remote can be even more intimidating. It doesn’t have to be! Keep up good habits and set clear agendas, have concrete start and end times, go in with a plan and execute. Your customers will be happier if your team is figuring out problems before they ever had the chance to experience them.