DevOps resilience: going active-active with an existing application

What does it take to run an active-active architecture? It boils down to data, and how you interact with it. The most common disaster recovery strategies I encounter with folks are either “hope and pray” on a single location for running an application, or “we have a backup facility we can switch to in a disaster” with an active-passive configuration. This is a shame, as running an active-active architecture is not science fiction, and while it takes some work, for many business use cases the pros greatly outweigh the cons.

The Devil’s in the Data

If you work your architecture from the outside in, an active-active architecture is relatively easy to understand: you have more than physical datacenter or cloud region where you run your application, you have your application running in each, and you have traffic management to load balance users to the closest, fastest and most available location for them. Where folks get stuck is in the database. How do you make sure a change made in one location gets propagated to the others? What happens if a user “bounces” from one facility to the next and attempts to read back data they just wrote? How do you avoid race conditions where user A at location 1 and user B at location 2 make a conflicting change to the state of the application?

The key to handling the scenarios above is understanding the business requirements of your data, and treating each type of data appropriately for an active-active architecture. If you try to “make all data available everywhere”, you’ll quickly run into either CAP theorem constraints or financial constraints on the high costs of this approach. Instead, let’s break the bigger problem down into smaller, more tractable problems.

Three Classes of Data

Sponsorships Available

For each type of data your application accesses or produces, let’s classify it according to one of three types, using a banking metaphor for simplicity (and for easier explanation to your business peers):

1) Is this data like a bank account balance? It changes frequently, and it’s incredibly important to make sure two conflicting changes don’t “break” the balance. We don’t want users A and users B making a withdrawal from two locations at the same time, thereby potentially overdrawing the account.

2) Is this data like your address and mailing information your bank keeps for your account? It changes, but not frequently, and during a change, it’s OK if there’s some short period of time where the multiple locations that host your application are out of sync for this data. They’ll eventually gain consistency.

3) Is this data like your banking statements? Once produced, they don’t change. They can be archived.

Account balance. Mailing address. Historic statement. Three categories, and for each, we have a data replication strategy that can support active-active. By categorizing the data and treating each with a different replication strategy, we’re accomplishing two very important things:

1) We’re being thoughtful about what data REALLY needs near realtime replication and race condition protection. By doing so, we’re reducing the amount of overall data that needs high priority replication. The less data to replicate, the better and more efficiently replication works.

2) Rather than try to replicate all data with the same strategy, we’re willing to use the right tool in the toolbox for each type of data, which means we’ll save costs by using archival and eventually consistent strategies where they make sense.

For each class, we’ll use an appropriate replication strategy:

1) For “account balance” information, we’ll replicate changes as quickly as possible, and make sure our application is aware and capable of resolving change conflicts. This is certainly one of the hardest pieces to get right, but by reducing the scope of how much data we need to address with this strategy, the problem is much easier to solve than by trying to apply this strategy to all three categories of data.

2) For “mailing address” information, we’ll use an eventually consistent replication strategy.

3) For “historic statements” information, we may choose NOT to replicate this to all of our facilities. We may revisit some old assumptions on what’s important to save and what’s not, and decide to discard some of this information. We may choose to simply regenerate the statements should they be lost. We may choose to bulk replicate this data at certain times of the day.

The Top Three Benefits

After we migrate an application to active-active, we can expect three key benefits:

1) Faster and more reliable failover in the event of a disaster. As opposed to an active-passive disaster recovery strategy, which depends on idle infrastructure coming up to full production speed in the event of a disaster, we avoid the risk of “we thought it was ready to handle production but something broke in between disaster recovery drills”. In an active-active scenario, we’re sending production traffic to each location all of the time, so we don’t have an “idle to full capacity” ramp-up problem; instead, we’ll have a “50% to 100%” (for two locations going to one) or “33% to 50%” (for three locations going to two” capacity challenge. We’ll need to make sure we always have spare capacity in an active-active configuration to handle the additional load during a failure.

2) More flexibility for making changes to applications and infrastructure. During a site maintenance event, we can shift all traffic away from our facility, perform our maintenance, and then pull the traffic back, without downtime.

3) Improved user experience. By connecting users with an application instance that is near where they are in the world, they’ll receive a better and faster user experience than if they were to be all sent to a single location.

There are others, but those are the biggest three.

The Top Three Drawbacks

Of course, it’s not all sunshine and kittens, and it’s not without work and investment:

1) You’re going to need spare capacity available, and be diligent and have conviction about maintaining a safe buffer for spare capacity. If you run two facilities each at 75% load, you’re going to have a big problem when one fails. Know your limits, and stick to them. Don’t fall behind in investing in this capacity.

2) You’re going to need to change your application. If you were looking for a “drop in solution” to active-active, this is not it. You’re going to need to change your application. The most common scenario is you’re going from a single database instance on a single database technology to three separate data services that treat each type of data with the appropriate data management strategy. The most common problem is “I used to be able to JOIN these two tables to get a result, but one was an ‘account balance’ type, and the other was a “mailing address” type, and now they’re stored separately”. In this case, your application will need awareness to retrieve the data from each appropriate location, and “JOIN” in software.

This sounds hard and time consuming, but there’s a great strategy for how to implement these changes. Presumably, your application accesses data using a core set of libraries (often an object relational mapping or ORM approach). In these libraries, you’ll want to make the changes for “what type of data is retrieved from where”. This will help minimize the impact to the rest of the application in terms of the interfaces to retrieve data, and keep the “data load balancing” functionality in one location in your codebase. Add in helper functions as appropriate to implement the JOIN functionality you need that previously was done in the database. Keep in mind, if you have a LOT of JOINs across categories of data, this strategy as a whole may not be viable for you.

It’s highly recommended you take a dark architecture approach to these modifications. For some period of time after migrating data to a new strategy, have your application perform reads and writes to BOTH the legacy and new implementations for a single data interaction use case. Compare the values and results, and log/alert when different, while returning the legacy result. This gives you operational experience with the new data strategy while minimizing risk, since the application is still using the legacy implementation for all functionality. After you have confidence that the two approaches are functionally equivalent, you can start using the values from the new implementation, and turn down the legacy implementation. For more details on dark architecture, see here: http://gigaom.com/2013/06/20/making-it-change-less-scary-using-dark-architecture/

3) You’re going to be running multiple database and data storage implementations. There’s a definite cost here in training your team, licensing, infrastructure, and having the monitoring and associated operational infrastructure required to run a dedicated strategy for each type of data. You’ll have to understand for your business whether the benefits outweigh the costs.

There are others, but those are the biggest three.

Your Next Step

How important is uptime, infrastructure agility and user experience to your business? Are these investments worthwhile? Only you will be able to tell for your business. Once you quantify the value of uptime (e.g., how much money do you lose and damage to your brand is there during an outage?), the value of agility (e.g., how much faster could your team evolve your application if they had more flexibility to make changes anytime to production without affecting customers?), and the value of user experience (e.g., how much faster could your user experience be by splitting your infrastructure into multiple active-active locations, and how much would your users appreciate that improved experience?). Come up with those figures, and compare them to the technical investments required above, and you’ll have a pretty clear GO/NO-GO for evolving your application to active-active.