As any team charged with keeping systems up and running will tell you, having one person on-call isn’t enough. What happens if your on-call engineer sleeps through his alert? What happens if his phone’s battery dies without him knowing, or if he gets an alert at an inconvenient time, like when stuck on a bus or in traffic? It will happen.
In today’s always-on world, companies can’t afford for that to happen without a backup plan. One or more people should always be waiting in the wings, ready to spring into action if a primary on-call is unable to perform his or her duties to the best of his or her abilities at any given time.
Engineers who perform backup duty don’t need to be as “on-call” as a primary engineer — no pagers needed here — but what backup engineers lack in readiness they can make up for in numbers. It often makes sense to have multiple backups or an entire team of backups.
Operationally mature companies often group the currently-on-call engineer and all of his or her backups into an escalation policy. Escalation policies set the order in which each person is alerted and the delays between alerts. While there are many ways to organize these escalation policies, here are some of the patterns most often used:
Primaries Aren’t Enough; Determine a Secondary, Too
Most companies have a primary on-call engineer, usually determined by a calendar that designates when members of an ops team are on call.
Operationally mature companies take their on-call practice one step further and supplement this rotation with a rotation of secondary on-call engineers. Why a secondary rotation instead of just paging the whole team when the primary falls through? If a large team is paged, you get a ‘tragedy of the commons’ effect— it’s urgent that somebody respond to the issue, but not clear how urgent the issue is that each individual person respond. Having a designated secondary removes that confusion.
The secondary rotation is set up to shadow the primary rotation and generally has the same members as the primary rotation. The difference is that the order of the engineers in the secondary rotation is usually offset from that of primary rotation— if your primary and secondary are the same person, you’re going to have a bad time. For example, if a primary rotation contains Alex, Bob, then Charlie, then a secondary rotation may contain Bob, Charlie, then Alex in that order. Companies often find that if last week’s primary is this week’s secondary, the it helps make sure there’s somebody with recent context who can help if the primary runs into trouble. It is becoming more commonplace for organizations to have both a primary and secondary on-call rotation; but, in an environment where customers expect zero downtime, secondary rotations are no longer an optional tool.
Enable a Strong Support System
Larger companies and companies with significant numbers of operations tasks often deploy a separate support team that handles all basic operations tasks before the organization elevates the tasks to an on-call engineer. This team — the first tier support team — usually knows how to handle all of the smaller, repetitive problems that crop up often and have clear resolution procedures. These first-tier teams often offer their support to multiple engineering teams and are placed first in an escalation policy.
If first-tier teams cannot resolve the issue, escalation policies typically dictate that primary then secondary on-call engineers must complete the task. However, there is one more support network operationally mature organizations must implement: management. Management is ultimately responsible for a company’s systems, and needs to be in the know when severe problems happen— both to hold their teams accountable and to muster additional support. At PagerDuty many of our EPs have a rotating manager schedule at the top of multiple escalation policies, and many of our customers use this pattern as well.
Put Your Tiered Teams to Work
In practice, an organization policy for a hypothetical “Database Ops” team might look like this:
- Assign the incident to the user who is on-call in the First-Tier Ops Team schedule
- Assign the incident to the user who is on-call in the Primary DBA schedule
- Assign the incident to the user who is on-call in the Secondary DBA schedule
- Assign the incident to the team lead
- Assign the incident to the dev manager
This escalation policy would unfold over time, with timeouts of between 10 and 30 minutes before each subsequent escalation, depending on a team’s needs.
Implementing an escalation policy that involves multiple tiers of resolution ensures that devops teams address all alerts — whether business-critical or minor — in a timely manner. Limited downtime is paramount in today’s fast-pace business environment. Customers expect their applications to run smoothly, or they expect to switch providers if devops teams cannot resolve issues fast enough. Organizations will be much less likely to see their processes fail, and much more likely to see happy customers, if they implement a multi-tiered escalation policy.