We’ve come a long way. Just five years ago, with virtual machines (VMs) and decent monitoring, I would never have considered active automated system repairs for the average organization.
There were many reasons for that. Monitoring was spread out and required a lot of work to spin together. The volume of logs from monitoring was generally huge once logging was consolidated, and that was the best source of automated recovery information. The time it takes a VM to start up and the space that a fully installed VM took up on disk were both negatives to doing automated response.
But that was then, this is now. And the world has changed. A lot.
Most companies using cloud or container based architectures with DevOps are doing monitoring. Tools such as DataDog and ExtraHop between them monitor the application side of performance and the networkwide responsiveness of the application infrastructure. Most container-based architectures are using one or both with alerting to keep track of their system’s and instance’s performance. Add in container management systems, and the infrastructure to support systemwide automation is starting to look pretty darned complete.
Beyond Monitoring
The most advanced DevOps and automation shops have gone a step further and are using these tools to do automated management on top of monitoring. While tooling varies—and I do not mean to use the above two as the only solutions—they show well the two sides of monitoring: application and environment (in the case of ExtraHop, networking). Most organizations have a wider selection, including in-source tooling and other monitoring systems. But a few have stepped up to basing the decision to auto-correct on these tools. If ExtraHop is showing extreme response times from an instance, but other instances are responding just fine, that scenario can trigger spinning up a new instance and taking the old one down. The same is true if DataDog shows one instance is taking forever to process requests, while other instances are performing admirably. If the overall system response time is lagging, new instances can be automatically added, either by directly spinning them up or making a call to the container management system to do so.
In the case of the failing machine, companies that have been doing this for any length of time will happily tell you to keep the instance taken offline so you have a record of what went wrong and can fix it. Heartbleed would have been a different beast had this been followed, and others will give you the same story of, “It’s hard to troubleshoot when the evidence is destroyed.” So while many companies tend to think along the lines of, “If we pulled it down and replaced it, we don’t need it anymore,” what should be done is, from the start, set up to keep the container and take time to look at it to help determine what went wrong before it goes wrong again on a different instance.
No Muss, No Fuss
The idea of the “self-healing network” is certainly nothing new. The difference is that machine speed, monitoring tools, management tools and platform have all evolved to a point where all of us can implement it. That doesn’t remove work around problems with responsiveness, but it does offer a stopgap that will keep many applications up and responsive while you figure out why they’re having issues.
One key is to have an alert system that lets the team responsible know what has happened. If automation replaces an instance, the team needs to know which instance it took down so it can grab the files and figure out what happened. If three of those notices come across in 10 or 15 minutes, you have a definite problem that needs more proactive interaction.
For clarity: My caveats every time I say “this works” are simple—network issues (inside or outside your network) may be causing perceived performance issues, and these systems can’t do much about that. Hardware failures are not resolved by dropping an instance and spinning up a new one, etc. But until monitoring and management alerts reach alarming stage, the application is likely humming along. And the team can focus on enhancements, not spinning up replacement instances. The larger the server base, the more important this is.
New products, enhancements to existing products, fixed bugs, improved architectures—concentrating on these issues instead of keeping the system live is always a win.
So if you’re not using automated management, it certainly is worth a look. Your environment is unique, and it fits better in some than others. So I’ll leave it to you to evaluate in light of your architecture. But if you can free up time, I highly recommend you do.