The server infrastructure management style of pets vs cattle was originally coined by Bill Baker of Microsoft though Randy Bias of CloudScaling further popularized it in 2012 and revised and clarified in this presentation. Essentially the original “pets” style of server management has mainly hand-built, named, and configured servers that are heavily cared for as much time has gone into their setup while the newer cloud style of “cattle” management has servers that are often numbered and created and scaled via automation and simply removed from service in the case of any issue. Colloquially the removal is termed “shoot the offending node in the head (STONIH)”.
In the past due to poor practices or OS service instability (NT4 comes to mind), sysadmins and developers often did not bother to figure out the root cause of issues and simply made reboots part of standard operation procedure. In the times before 24/7 global customer bases and always on broadband and smartphones it may not have even been a big deal.
On the surface cavalierly destroying and refreshing machines misbehaving in some way seems eerily similar to reboots. It’s different this time. Really.
Virtualization itself can cause a special class of issues that are out of your control. Noisy neighbor problems, underlying VM infrastructure/network or other issues can often only be resolved by starting a fresh node (possibly in a different logical area).
High Availability (HA)
HA has become more ingrained in systems architecture. For most instances, without inordinate effort you should be able to smoothly take out a bad server out of a cluster or load balancer with no or minimal downtime so you’re not making your users suffer and in fact will be giving them a better experience than potentially landing on a server with intermittent or worse issues.
Following a few proper procedures provides some important differentiation and further opportunities to investigate compared to simple reboot techniques.
Automation and Configuration Management
Solid automated provisioning allows you to bring up the EXACT same machine configuration, put it in a different network subnet, or tweak it to have newer OS patches or software versions for testing if needed. Recreation is not a binary choice. You have options on how you do it and opportunities to learn from it.
Monitoring has come a long way from simple CPU/Disk/Memory and service uptime. Many vendors offer deep views into your application through application performance management (APM). It’s also easier to historically dig into not just that a CPU spike occurred but which processes were hogging cycles at the time to get at a root cause. You can more readily identify if an issue lies with your application or some supporting/OS services and with good automation simply spinning up a new server to test a theory is easier than excessive pondering.
Centralized logging can allow easier further log inspection after the node is removed and with much better search, analysis, and correlation capabilities than opening each file in a text editor on the machine itself.
See you later, not farewell…
The node often doesn’t have to be destroyed. It merely can be pulled out of service or stopped. It or its virtual disks can be examined further at a later time if needed. Exercise caution with spinning up a node again if it has any behavior to auto-join a cluster/farm or will start processing jobs, sending e-mails, etc on its own. Having a practiced isolated network environment can be handy for further analysis and can double as a test bed in the case of potential security breach analysis.
By following the guidelines above, STONIH will minimize the time any misbehaving nodes affect your user experience. Just don’t forget to follow-up on a root cause especially if patterns or recurrences start!