Every now and then, engineering teams will get into trouble for any number of reasons. Sometimes, explosive growth catches the team by surprise. Or, on-call has gotten out of control, with alerts going off every five minutes. Or, development and operations teams simply have stopped seeing eye to eye. Regardless of why, the team is in a bad spot and something needs to be done to resolve it.
If it’s a new problem, the fix might be easy: Add a few more servers, roll back to a known good application version or get everyone together over pizza and beers to clear the air. Often, however, the problem creeps up on you over time and suddenly the hole is so deep you can’t find the way out. At LinkedIn, a team that has gotten to this point will often declare a state known as “Code Yellow.”
Some people assume the name Code Yellow is based on traffic lights, but more accurately—and with a geekier twist—it’s actually from your favorite “Star Trek” series. More precisely, it’s how the crew of the Starship Enterprise indicates their current defensive condition. Either way, the definition is clear: Something is wrong, and we need to move forward with caution. True to both metaphors, we also have a Code Red. That’s better described as an immediate crisis, with everyone working 24 hours a day until it’s resolved. The Code Yellow is slightly more laid-back: This is everyone’s primary focus, but during business hours only. Code Yellows also tends to last on the order of months, while a Code Red should last on the order of days.
Other companies may use a different term than Code Yellow, but the effect is the same: The team is communicating to the rest of the company that they’ve identified a serious problem that is a priority to be fixed to ensure the success of the team and, therefore, the company. The ability to do this is an important aspect of open and honest communication, a value that is critical to a healthy culture and can often get overlooked. Talking about our problems is just as important, if not more so, than celebrating our successes. Teams can learn more from fixing a problem than they can from a total success.
This is Not a Failure
The first step in starting a Code Yellow is to understand this is not a failure. There is no shame in admitting that the team has a problem that needs to be fixed. Bugs happen, despite our best efforts to avoid them. The only thing we can do is diagnose the problem and remediate it. The only time we fail is when we turn a blind eye to these problems. This applies to how our engineering teams interact with each other just as much as it does to the software and systems we produce and run.
It is critical, however, that the right problems are addressed. Most of the time, we’ve gotten into the current situation because of a slow boil—increasing technical debt, many small issues or breakdowns in a process—that eventually built up into a crisis. The goal of the Code Yellow must be to not only remediate the current problems (a reactive component), but also make sure they are not repeated in the future (the proactive component).
Planning for a Successful Code Yellow
There are several components required for a successful Code Yellow, as well as necessary pieces to get buy-in from the rest of the organization:
- Problem Statement: There must be a clear and agreed-upon statement of the problems facing the team that prompted the Code Yellow. This should include not only what the current problems are, but also what the current understanding of their root cause is.
- Exit Criteria: Next, you need to have specific goals that the team will work toward to exit the Code Yellow. These should be traditional SMART goals: specific, measurable, achievable, relevant and time-bound. These goals are what make it possible for the team to enter a Code Yellow in the first place, as it covers a fixed scope and is not open-ended.
- Communication: All information about the Code Yellow, including the announcement (which includes the problem statement and exit criteria), the successful conclusion and periodic status updates should be sent to the larger organization. This may be your department, or it may be the entire engineering organization. It may even be the entire company, depending on the nature of the problems.
- Project Management: Like all large projects, there needs to be someone responsible for organizing the work and communicating information. As this represents an “all hands on deck” scenario for the impacted team, it is usually helpful to have a dedicated project manager (PM) to help with this. This is typically a PM who is knowledgeable about the team and the work, but not directly involved with the execution. This frees up the managers and individual contributors to focus on the work at hand.
Once each of these aspects have been thought about, and the decision has been made to enter Code Yellow, the team’s first act is to reorganize their priorities around the exit criteria. This often means putting quarterly goals on the shelf. It may also be necessary to establish a dedicated meeting around discussing the status of the exit criteria.
Space to Breathe
It is all well and good for a team to enter Code Yellow and work with a single focus on the goals that have been set to make things right, but this is not enough for the team to succeed. For true success, everyone surrounding the team must understand the situation and give them the space to do their work. This is the place where a healthy engineering culture rises to meet the challenge.
- Expect Delays: The most common way that a tangential team will be affected is through delays. They must expect that any requests that have been made of the impacted team may be delayed if they are not within the scope of the exit criteria. The Code Yellow involves, at its core, a reordering of priorities to address the stated problem. Outside teams need to factor this in and understand that their own project timelines may need to be adjusted.
- Minimize New Requests: Other teams should also refrain from asking the impacted team for new things that are outside the scope of the defined exit criteria. Minimizing these requests, in addition to accepting delays on any existing requests, allows the impacted team to spend their limited engineering hours on getting to the other side of the Code Yellow.
- Requests for Assistance from other teams: The team in Code Yellow may find they need outside help to reach their goals. For example, if there is sudden, explosive growth in traffic, they may need to accelerate the provisioning of new hardware. Finding yourself on the receiving end of a request such as this may require shifting your own priorities. Always remember that the team is all part of the same company, and as such, everyone succeeds or fails together.
Engineering teams rarely stand alone, and it is important that everyone understand the value in having those teams working well, and working together well. A little temporary delay in goals to assure that this is the case is well worth it.
Light at the End of the Tunnel
Code Yellow represents a significant amount of high-priority work and working through it will often be stressful for the team. Saying “no” to coworkers that are making a reasonable request is hard, and the work that is in scope rarely involves spending time on interesting new features. In addition, if the problems being addressed include communication issues between groups, there are going to be some difficult conversations that need to happen. However, as the team approaches the end of the work, it will be much easier to see through to what lies on the other side of the exit criteria.
The ultimate goal of the Code Yellow is to get the team out of a reactive mode where they are running from crisis to crisis and into a proactive state where they are able to work on the right big projects. Achieving the exit criteria will mean the engineers are more effective and able to work proactively. This is a stronger team—engineers are happier because they’re not under a heavy stress of operations work, the team is working well because they’re talking to each other effectively, and customers are pleased because requests are handled either through automation or in a reasonable amount of time.
Does your company have an internal process that’s similar to how LinkedIn does a Code Yellow?