Avoiding Database Migration Pitfalls with Continuous Software Delivery

When people talk about continuous software delivery, they often focus on the delivery mechanism itself, such as automation tooling or branching strategies. What is frequently overlooked is the simple truth that continuous software delivery is worthless without continuous service availability. In other words, if you’re going to frequently update your service, your customers shouldn’t frequently experience disruptions.

Despite all good intentions, the cause of many unintentionally-disruptive deployments are database migrations. This makes sense, as all other components of a service infrastructure are fairly easy to swap out or to roll back. Software directory symlinks can be easily changed, EC2 instances can be easily started, DNS entries can be easily updated, but performing a breaking change on a multi-million row table in a database that’s under heavy load? Not so easy.

Successful Migrations

There are three necessary requirements for a database migration to be considered a success:

The delivery of the service was not disrupted or degraded
The database migration must be repeatable (re-entrant)
The software version can be rolled back to the previous release if needed

The first seems obvious, but can be surprisingly difficult to achieve. The chief reason for this lies in the vagaries of how different database technologies, and even versions of the same technology, handle things like table locking during an ALTER statement. The other reason is a hidden complexity within your application code. For example, a JOIN statement that causes a frequent query to be blocked by a seemingly unrelated table change.

The second criteria means that if a migration fails for some reason, it cannot leave the data in such a state that the migration cannot be run again – essentially, running the same migration twice should not result in a negative outcome.

The third criteria absolutely precludes ever doing a “one way” release where rollback is made impossible due to data destruction. It’s okay to have a release that rules out rolling back several releases; without that, the database would become a giant pile of discarded data. That said, it’s never a good idea to paint yourself into a corner with a release that you cannot possibly revert.

Avoiding Service Disruption

The simplest way to avoid service disruptions is to never perform a database migration. But, that just isn’t feasible, so how do we avoid stepping on a landmine during a migration?

First and foremost, implement a policy of strict database migration reviews. This doesn’t need to be burdensome or time consuming. It just means that someone who is best equipped to understand the impact to the database technology actually gets eyes-on the migration before it is run. This is a role often served by your database administrator, chief architect or most senior developer.

Another good idea is to set up a hook in your code versioning repository, or a test that’s part of your build process, that examines database migrations and looks for certain keywords – and the lack of other keywords. For example, it might “fail the build” if it finds an ALTER TABLE statement that is not accompanied by the word CONCURRENT.

If you’re dealing with a lot of legacy code, or a monolithic application with significant complexity, then there are two very important safeguards you can put in place: load testing and SQL statement logging.

Load testing means having a development environment setup where you can apply a database migration to a database that is under load, i.e., experiencing a level of usage that roughly approximates your production database load and which contains an amount of data similar to production. Then, while your database migration is running, you can run an application smoke test or similar to see if production-like load combined with the migration results in any unexpected slowness or errors.

Remember, all database migrations are fast when they are the only thing running on your database, and faster yet when the testing database is a tiny fraction of the size of the real thing. Going into production feeling confident that a migration took only ten milliseconds to complete on a half-empty database with zero active connections is a fool’s errand.

When you have a poorly-understood or monolithic application, it can be a very enlightening experience to enable SQL statement logging on your production database server. Remember that MyISAM table that never seems to change and against which it should be safe to run a giant blocking migration? SQL statement logging might reveal that every single page load executes a long-forgotten JOIN statement that scans that entire MyISAM table. Or that the user login process looks in that table for reasons nobody can remember. This is also an area where using a solid application performance tool like New Relic can make a very big difference.

Repeatable Database Migrations

This one sounds simple but can actually be a bit tricky to implement. Consider the following scenario, you have a table that contains three columns: id (integer), username (string) and password (string).

As it currently stands, the passwords are stored in plaintext in the database, and your application does a simple string comparison between what the user enters and the results of a SELECT statement. The next version of your application is going to hash the password entered by the user, and compare the hash against a hash returned from the database by SELECT.

There are two possible ways of getting from where you are now, to where you want to be – one is dangerous, and the other is safe.

The dangerous way would be to create a database migration that iterates through the password column and replaces each plaintext value with a hash of that value. Basically a SELECT; md5(); UPDATE; loop. This is dangerous for a number of reasons:

If the migration stops before completing, the database will be left in an unusable state where neither the old nor new code can make use of the entire table.

If the migration fails to complete for any reason, it cannot be run again without corrupting all the previously-migrated values.
If the new code has to be reverted to the old code for any reason, the table is now in a state that cannot be used by the old code.

The safe, and repeatable way of performing this migration actually involves three releases:

Version 2.0 migration adds a new column (pass_hash) and migrates password to it, and new application code must INSERT and UPDATE to both

Version 2.1 application code must now use only the pass_hash column, and ignore the password column.
Version 2.2 migration removes the password column, and new application code continues to use only the pass_hash column.

You can rollback from the first release (2.0), as all the original, unencrypted password data still exists in the database, and the previous code release won’t care about the new column (pass_hash) being present. You can also run that first migration multiple times – just have it remove the pass_hash column if it exists prior to creating it.

Next, you deploy code (2.1) that makes use of the new column only. You might be tempted to bundle the database cleanup migration with it, but don’t – if you need to rollback for any reason, you won’t be able to if you’ve removed the password column. It’s always a good idea to delay breaking changes, like a column deletion or dropping a table, by at least one release.

And finally, run a migration (2.2) that drops the password column, if it exists. That latter condition is so you can run the overall migration more than once if needed. If the password column exists, remove it; otherwise, merrily continue with the rest of the release. Note that at this point in time, the live application code is ignoring the password column entirely. This is important, since the current application code will be using the database while your migration is running. In other words, any migration must be safe to run under both the current and the about-to-be-deployed code version.

In case you hadn’t already guessed, this approach to migrations does take a bit more conscious thought and effort than the old, dangerous way. You’ll need to keep track of “deprecated” bits of your database for future removal. And you’ll need to occasionally write code to intelligently update two columns or tables in lockstep, and then remember to remove that code in the future. That said, it’s definitely worth it in the long run when you’re able to release frequently with zero downtime.

Release Rollback

Nothing produces longer outages and greater negative customer impact than being unable to rollback or revert a release. If you’ve accidentally deployed a major bug into production, you want to quickly revert to the previous version that didn’t have that bug, and then reproduce and fix the bug at your leisure in your development environment. Being unable to reproduce the bug in your development environment is not a valid reason to fail to rollback a release, and it’s an entirely separate discussion apart from migrations.

The essence of rolling back is speed. It must be a fast and reliable process. The number one way to ensure it is anything but fast and reliable is to have a breaking or one-way database migration in the release you wish to roll back. Thus, anytime you are creating a new database migration, ask yourself two simple questions:

What would happen if I ran this migration with the current application version?
What would happen if I ran this migration twice?

If it would change the database in a way that would cause the current application version to explode, then it needs fixing. And if it would change the database in such a way that it can’t be run again, then it needs fixing.

The End Results?

In a nutshell, this approach to migrations means that each migration will be very small and very simple. Expect to get from state A in your database to state B via many safe steps, instead of one giant, dangerous step. Your customers, and your application uptime, will thank you!

About the Author/Nate Lindstrom, VP solutions engineering

Nate Lindstrom is the VP of solutions engineering for NS1, an intelligent DNS and traffic management platform with data driven architecture and a unique Filter Chain routing engine. NS1 is purpose-built for the most demanding, mission-critical applications on the Internet. NS1’s comprehensive platform technology leverages infrastructure, application, and network data to make intelligent routing decisions in real time, ensuring optimal application performance and reliability.