Amazon S3 storage service recently experienced a widespread five-hour outage in its East Coast availability zone. Many next-generation consumer and business applications rely on a cloud storage, so the S3 outage quickly cascaded and temporarily crippled organizations from Netflix to Slack.
Cloud outages like this are normal, and notable because they affect so many businesses. The last significant AWS outage was in August 2016. While Amazon S3 is architected for data durability, that doesn’t equal fast recoverability during an outage. As good as cloud platforms are, they still leave a few gaps:
- Availability zones don’t equal recoverability. S3 is designed to withstand a site outage within a zone, but as the most recent outage shows, networking issues can lead to a widespread outage across an entire region.
- Data still needs to be backed up. As we noted in a previous blog post, even Amazon recommends backing up data.
- Recovery can be slow and tedious. It’s one thing to back up data. It’s another thing entirely to recover it. It can take hours or days to recover data after a failure—especially for hyperscale applications and databases.
- Data is often in one “basket.” If backup data is stored in the same cloud service as the primary data, in the same availability zone, there’s no way to recover data during a widespread outage.
- Data can get compromised or enter an inconsistent state. The cloud itself doesn’t protect data from application- or database-level corruption, or human error.
A data backup, recovery and continuity strategy needs to be designed with the cloud in mind. To make sure you can recover quickly, even from a cloud outage:
- Keep backup data in another service or region. Failures like this one often affect an entire region. A backup and recovery strategy needs to include the ability to recover in another region, cloud service or even a private cloud.
- Have a fast recovery process. Traditional backup solutions and scripting-based approaches can’t recover data quickly, particularly if the application needs to be recovered to a different topology.
- Have point-in-time recovery. Since data can get compromised at the early stages of an outage, being able to restore applications quickly to a point in time is important as well.
The cloud as-a-service architecture is more resilient than traditional infrastructure, and gives you agility. But when failures do happen, they are also entirely out of your control. Don’t ignore recoverability and resiliency of the data because it’s in the cloud, and don’t expect the same recovery processes and tools to work on next-generation, hyperscale applications.