Whether you’re an operations engineer or a customer, scheduled maintenance is often viewed as a given when delivering software as a service. While far better than surprise outages, planned downtime is still tough for everybody involved– customers will always have off-hours access needs, and the low-traffic times you want to perform maintenance are often the same times your operations engineers and admins would like to be sleeping. On top of this, scheduled maintenance implies your system is less reliable than you think, because you’re afraid to change it during the workday.
How do we make scheduled maintenance better for everyone involved?
Planned maintenance is a non-starter at PagerDuty, since our customers don’t know when their monitoring tools will fire alerts. We’ve had to focus hard and innovate to deliver continuous uptime for our service, and Doug Barth, my colleague and a member of our operations team, recently did a talk at DevOps Days Chicago on the strategies we used to ditch scheduled maintenance altogether, and replace it with iterative maintenance that doesn’t compromise your entire system.
1. Deploy in stages
Your deployments need to be rock-solid. They should be scripted, fast and rolled back quickly, as well as tested periodically to ensure rollbacks don’t lag.
They also need to be forward and backward compatible by one version. It’s not an option to stop the presses when you push out a new version. Red-blue-green deployments are crucial here, as they ensure only a third of your infrastructure undergoes changes at any given time.
Lastly, stateless apps must be the norm. You should be able to reboot an app without any effect on the customer (like forced logouts or lost shopping carts).
2. Avoid knife-edge changes
Use canary deploys judiciously to test rollouts, judge their integrity and compare results. These test deployments affect only a small segment of your system, so bad code or an unexpected error doesn’t spell disaster for your entire service.
During his talk, Doug suggested a few practical ways to accomplish this:
- Gate features so you can put out code dark and slowly apply new features to a subset of customers.
- Make sure you’re testing the changes on a representative sampling of your customer base: different account sizes, usage patterns, and configurations.
- Find ways to slowly bleed traffic over from one system to another, to reduce risk from misconfiguration or cold infrastructure.
- Run critical path code on the side. Execute it and log errors, but don’t depend on it right away.
3. Make retries your new best friend
Your system should be loaded with retries. Build them into all service layer hops, and use exponential backoffs to avoid overwhelming the downstream system. The requests between service layers must be idempotent, so that you can reissue requests to new servers without double-applying changes.
Use queues where you don’t care about the response to decouple the client from the server. If you’re stuck with a request/response flow, use a circuit breaker approach, where your client library delivers back minimal results if a service is down—reducing front-end latency and errors.
4. Don’t put all of your eggs in one basket
Distribute your data to many servers, so that no one server is so important you can’t safely work on it.
At PagerDuty, we use multi-master MySQL clusters, which help with operations and vertical scaling. We also use decentralized, linearly scalable databases like Cassandra for our alerting pipeline: since transactions aren’t gated on the behavior of a single node, we can do operational work during the day.
Put together, these four strategies set an organization up to ditch annoying weekly downtime and help admins and operational engineers sleep more, worry less and maintain better—all ahead of schedule. Not only is this great for your team’s quality of life, but it also helps you deliver a great customer experience– no matter where your users are or what time it is.