Everyone want’s their application SLA to be 99.99999%. However with some applications it is not a desire, it is mandatory. The Consequences of failure in mission critical applications are drastic. And with commercial applications damaging, it is reputation, and internal support costs.
Just ask the administration of London’s Heathrow Airport, the busiest airport in Europe, that a few weeks ago had a large computer meltdown. This led to flight scheduling chaos worldwide, at a level never seen before. The reason was the inability of Swanwick workstations, used by traffic controllers, to load balance.
High availability can only be achieved with comprehensive planning and careful system design. And it can be a complex undertaking with far-reaching and operational implications.
You can oversee high availability planning at two levels:
The system level:
- Capacity Planning: Anticipating the number of users and requests at different times and dates is not an easy task. Regular reviews of traffic loads and event logs can help to create a utilization baseline from which future trends can be predicted and analyzed. Identify the infrastructure resources, such as network bandwidth, memory, processors, number of nodes in a cluster etc., and have their utilization or performance measured and compared to their maximum available capacities. This allows you to determine how much capacity is available for each. You will be able to estimate at which point a given resource is likely to drain its excess capacity, by forecasting the utilization trending reports and comparing them to the maximum available capacity. Capacity planning is a continuous activity and should be a regular part of the software delivery pipeline by the DevOps team.
- Redundancy Planning: Which means duplicating system components so that no single component (power, servers, storage, networks etc.) failure can bring the whole application down. Having only a single component is just asking for trouble, and the SPOF (single point of failure) should be avoided at all costs. In a perfect world, each system component would have a redundant one, in case the primary died or stopped responding … but we live in the real world.
Also if the infrastructure is out of your control, and you are subject to the Cloud vendors SLA, you can consider using multiple Cloud service providers, are investing in flexible SDN technology that make your networking portable. So that you can move entire multi-tier environments between Clouds without reconfiguration.
The failure protection level:
Anticipating system issues in advanced, a pre-mortem, is an important effort. Assume something will break, and reverse engineer where, and how. Issues can be caused by a variety of hardware and storage problems such as:
- Incorrect configuration of clusters. Is configuration file/repository on one of the nodes different from the others?
- Failure to monitor the cluster state, which can lead to unnoticed degradation in some of the components (for example “noisy” Ethernet adapters),
- Mismatch of cluster resource to a physical resource (eg non existing mount-point in Unix, incorrect Ethernet adapter, etc.),
- Networked storage access issues. Critical networked file systems which are not accessible by all nodes),
- Missing redundant SAN I/O paths, hidden single point of failure (eg public or private links on the same switch /vLAN) etc.
- Software issues for example different installed software versions/patches, license expiration, incompatible configuration files etc.
Paying attention to these common issues, those unique to your platform, and understanding the weak points, will help you identify response methods for each. This will save time, but also an exercise that will help you identify issues before they happen.
Using configuration management tools also help you standardize environments, and in case something does happen allow you to spin up alternate infrastructure on demand.
With the interactivity of DevOps you can avoid the dreaded phone call that tells you the system went down because of one of the above issues. You should, because of great processes integrated with great tools, get automated warnings before a performance issue occurs.
Standardize
DevOps it’s all about automation, but you cannot automate what you can’t standardize.
Minimize the number of High Availability configurations, and aim to standardize and re-use the same design pattern. Use the same storage architecture with all clusters running the same software versions and patch-levels on all.
Hidden software variables are hard to spot, but easy to avoid.
Pick the best tool for the job, but try to limit the variation across infrastructure vendors. For example, replication across technology from different vendors can increase the challenge of interruption-free failover.
Document and distribute the standards in order to facilitate coherence across your organization for future systems or current system changes.
Use (and test before using) automatic failover
Automatic failover is the most suitable option for organizations needing a zero downtime environment, but just crossing your fingers isn’t enough, before using it you need to test it.
First of all you need a plan. The plan should list all the types of failures that you expect to handle, but start with limited tests that have low-impact and gradually ramp-up. After simulating a device or link failure, start looking at what other services need to work in order for your applications to continue to function in a failure situation.
Automatically Audit
Auditing your High Availability configurations is the most important factor in ensuring successful recovery. This involves either using a dedicated tool, such as ScriptRock, or a set of custom, in-house scripts. Automated audit tests can and should be run every day.
Some ground rules for a successful auditing environment would be: automate at minimum, the collection of important configuration elements (OS and software configuration, cluster configuration, networking configuration, storage allocation etc.) and regularly collect configuration data. Otherwise it’s difficult to execute post-mortem analysis when downtime ensues. Keep in mind that automatic data collection will reduce the time and effort involved in auditing, testing and anticipating for future downtime.
Create a Collaborative Work Environment
A successful implementation requires correct configuration of network, storage, server etc. Practical experience shows that deployments done by just one team are usually more error prone. Several people must be educated and engaged, otherwise suboptimal or even incorrect configurations might be reached. Assemble a High Availability team as well as a High Availability environment. Include members from all relevant departments with experience in high availability principles and technical requirements. The team should plan periodically reviews of the infrastructure and configurations. And define together auditing and testing goals.
The DevOps tooling market brings great opportunities to this classic challenge. With solutions that can automatically distribute loads, and warn of failures. Analytics platforms that can help you anticipate or track down issues after they happen. And load testing tools like BlazeMeter and Visual Studio Online to test your application at peak loads before it gets deployed.
There is no such thing as 100%. However building in some flexibility in your environment, and anticipating issues can prevent more of them, but also respond faster, to the point that failover is automatic.