One of the biggest pains of automating the deployment process is that final step, when everyone agrees it is time to take the accumulated changes and expose them to the public. This process is the point at which systems go down and users find themselves suddenly cut off. Scheduling such changes for off periods may be useful, but “off period” is rarely “no users impacted” period. One of the goals of DevOps should be to minimize impact to users, as the point is to make things more repeatable and consistent. Blue/Green deployment is one of the processes introduced to make the final rollout step of continuous delivery (CD) smoother.
This process, first (to my knowledge) proposed by Jez Humble and David Farley in the book, “Continuous Delivery,” proposes that you have two environments—one testing and one production—that mirror each other. Then, when it is time to release, the router is changed to send users to the “test” environment, and it becomes, for all intents and purposes, production. The problem with this simplistic description is that everyone on the system at the time of “cutover” is “cut off,” since the new system doesn’t know what they were doing. That’s not the best solution for users; in fact, it’s downright disruptive.
Like Load Balancing Quiesce
This is not that far off from the type of thing I’ve talked about using load balancing for, and F5 Networks (where I got the idea) has talked about it for a while. You can tell a good load balancer to stop offering connections to server X in the pool to let it settle and then be taken out of service for maintenance. The same technology could be used to say, “Let existing users finish off on their existing connections, but do not allow new connections to these servers,” directing all new connections/users to the new deployment while not dropping the connections of the existing users that presumably are in the middle of doing something.
Now in load balancing, you have to consider how long to allow users to stay connected before arbitrarily ending their session, but in the world of REST APIs this is not as important as it is in longer-lived, client-server style connections. In the end, though, there must be a defined point where everyone is forced to the new systems. At that time, merely making the switch on the load balancer and taking the old production systems out of the pool will force users to connect to the new systems.
It Works for A/B Testing, Too
A similar process can be used to try out new features on a subset. Once there is a load balancer in place, and you know what you want to measure to determine if a new feature is enjoyed by users, an instance or two behind the load balancer can be spun up and start accepting connections. The load balancer will naturally start putting those servers to use, and soon you’ll have a subset of users seeing the new features, while most blithely continue on with the old. When testing is done, the decision of whether to withdraw the servers/instances with the new code or move all servers in the pool to the new code can be made.
Overall, offering less deployment disruption will be required as your organization moves into more frequent releases under CD. Look for solutions like blue/green deployment to make those releases as non-disruptive as possible.