Business Continuity in the Azure Cloud: Understanding the Options

There are numerous ways to assure business continuity for applications running within the Azure cloud via various high availability and disaster recovery provisions. But, selecting the best and most cost-effective provisions for each and every application can be extraordinarily difficult owing to the myriad choices available. At best, making poor choices during development can waste money. At worst, a wrong choice can cause failover provisions to fail when needed during operation.

All business continuity provisions involve hardware and software redundancy, data replication and some means of failover and failback. Purpose-built failover clustering software has long been among the most popular choices based on its proven dependability and cost-effectiveness. The clusters are relatively easy to deploy in an enterprise data center using shared storage. But with no shared storage available in the public cloud, configuring failover clusters in Azure becomes considerably more challenging.

This article examines the options available for high availability (HA) and disaster recovery (DR) provisions within and for the Azure cloud. Special emphasis is given to SQL Server as a particularly popular application for Azure.

Options Available Within the Azure Cloud

The Azure cloud offers redundancy at three layers: within data centers, within regions and across multiple regions. Within data centers, Availability Sets are used to distribute redundant servers across different fault domains located in different racks, which protects against failures at the server and rack levels. This affords some protection for some failures, but provides no protection during a sitewide failure, such as the one that occurred in September 2018 in Azure’s South Central US Region. The 99.95 percent Service Level Agreement (SLA) only guarantees that in an Availability Set with two or more servers, at least one will have external connectivity, but it does nothing to assure availability at the application level.

To protect against sitewide failures, Azure is now offering Availability Zones (AZs). Regions with AZs have at least three data centers interconnected via a high-bandwidth, low-latency network that supports synchronous replication. Azure offers a 99.99 percent SLA for AZs, but again, only guarantees at least one server will have external connectivity—nothing less and nothing more.

For redundancy during major disasters, Azure offers Region Pairs, in which each region is paired with another in the same geography (e.g. United States or Europe). The regions are separated by at least 300 miles to protect against widespread disasters that might impact an entire region, including across multiple AZs. By pairing regions, Microsoft is able to apply updates one at a time to prevent the “update gone bad” scenario and will prioritize the recovery of at least one region in each pair during an Azurewide outage. But again, Azure only guarantees “dial tone” for the servers, leaving it to the customers to ensure availability at the application level.

Options Available in OS and SQL Server Software

Windows Server Failover Clustering (WSFC) is a standard operating system feature that is utilized by many applications to provide HA protection in enterprise data centers. However, WSFC requires some form of shared storage, which historically has not been available in any public cloud, including Azure’s.

Microsoft addressed this problem in the Datacenter Edition of Windows Server 2016 by adding Storage Spaces Direct (S2D), a software-defined, virtual storage area network. But because the cluster must reside entirely within a single data center, S2D is incompatible with Availability Zones. Applications that require multisite HA/DR protection will, therefore, need to use third-party failover clustering software, log shipping or some other additional provision(s).

With no equivalent to WSFC or S2D for Linux, HA/DR protection requires either the use of open source software, such as Pacemaker, or a third-party failover clustering solution. Because supporting open source software requires a substantial and ongoing commitment, only the largest organizations have the wherewithal to even consider this do-it-yourself option.

SQL Server, whether for Windows or Linux, offers two of its own HA/DR features: Failover Cluster Instances and Always On Availability Groups. FCIs afford two major advantages: inclusion in the Standard Edition; and protection for the entire SQL Server instance, including system databases. A notable disadvantage is the need for cluster-aware shared storage, including the virtual variety with S2D, which is only supported for SQL Server 2016 and later.

Always On Availability Groups is SQL Server’s more robust HA/DR offering, capable of delivering recovery times of 5-10 seconds and recovery points of seconds or less. Its disadvantages include the lack of protection for the entire SQL instance and the need to license the more expensive Enterprise Edition, which can be cost-prohibitive for many applications.

A significant disadvantage with all application-specific options is the need for DevOps staff to use different HA and/or DR solutions for different applications. Having multiple HA/DR solutions inevitably increases complexity and costs, making this another reason why application-agnostic third-party solutions are so popular.

The Third-party Failover Clustering Software Option

Being agnostic with respect to both applications and platforms enables purpose-built failover clustering software to provide a complete HA/DR solution for virtually all Windows and Linux applications. Application-agnosticism eliminates the need to have different HA/DR solutions for different applications. Platform-agnosticism makes it possible to leverage, while not being dependent upon, various capabilities and services within the Azure cloud, making this option suitable for use in private, public and hybrid cloud environments.

Failover clustering solutions include, at a minimum, real-time data replication, continuous monitoring for detecting failures at the application level and configurable policies for failover/failback. All are designed to satisfy mission-critical recovery time and recovery point objectives, and most also offer a variety of value-added capabilities to simplify implementation and management.

Clustering with Confidence

Whether used individually or in various combinations, all of these options can have a role to play in making HA and DR protections more effective and more affordable for all applications—from those that can tolerate some downtime, to those that demand five 9s of uptime. But be sure that the options chosen afford protection at the application level for all likely failure scenarios.

— Dave Bermingham