Nagios is not a monitoring strategy

When I visit clients to talk about DevOps, I usually ask them what their monitoring strategy is. Too often, the answer I hear is “We use Nagios”. I think Nagios is a great tool, but it sure is not a strategy. Nagios does a good job of monitoring infrastructure. It will alert you when you are running out of disk, CPU, or memory. I call this reactive monitoring. In other words, Nagios is telling you that your resources are getting maxed out and you are about to have issues. Proactive monitoring focuses more on the behavior of the applications and attempts to detect when metrics are starting to stray away from their normal baseline numbers. Proactive monitoring alerts you that the system is starting to experience symptoms that can lead to a degradation of performance or capacity issues which is more preferable than Nagios telling you are about to be screwed. With reactive monitoring, it is not uncommon that customers start complaining about the same time that the Nagios alerts start going off. The goal of proactive monitoring is to head off issues so that customers don’t even notice.

The next question I ask is “What things are you monitoring?”  A typical answer usually revolves around various infrastructure assets and databases. That’s a good start but there is much more to consider. But first, let’s talk about why proactive monitoring is so critical. In the pre-cloud days we used to ship software to our customers where they would install the software, perform capacity planning tasks, manage the infrastructure, and operate the day-to-day activities. Once we shipped the code we were done. In today’s world, we are no longer shipping product. Instead we are delivering services that are always on. The customer no longer owns and operates the infrastructure and the software. Instead they pay for a service and expect that service to run reliably all the time. To meet those expectations, we need a more robust monitoring strategy. We need to monitor more than just the infrastructure.

A good monitoring strategy starts by identifying all of the actors who needs access to data and all of the categories of data that needs to be tracked. Some metrics are monitored in real-time while others are mined from log data. Every good monitoring strategy is accompanied with a sound logging solution. In order to perform analytics to predict trends within the data, one must collect various data points ranging from customer usage activity, security controls, deployment activities, and much more. The following presentation goes into much more detail about the different areas that should be monitored and why different actors need these data points to perform their jobs.

The bottom line is, before building in the cloud, it pays to invest some time into a sound monitoring strategy. I have seen too often where teams don’t think through how to support these highly distributed, always on SaaS solutions and end up delivering software that does not meet the reliability and quality expectations of  customers. Monitoring provides feedback to developers, product owners, operators, and even customers so that systems can continuously be improved. Nagios is great, but there is no single monitoring solution that can implemented to effectively operate today’s always on services.

About the author  ⁄ Mike Kavis

Mike Kavis

Mike is a VP/Principal Architect for Cloud Technology Partners and heads up their DevOps practice. Mike was the CTO for MDot Network who won the 2010 AWS Global Startup Challenge. Mike is also the author of "Architecting the Cloud: Design Decisions for Cloud Computing Service Models (IaaS, PaaS, SaaS)".

  • Pingback: Nagios is not a Monitoring Strategy | Kavis Technology Consulting()

  • LJ

    Nagios alone is not a monitoring strategy, true. But if your Nagios alerts are only going off after your customers are crying, you have two big problems: 1) Badly set thresholds – your alerts need to go critical before your service degrades; 2) Badly written checks – if all you do is monitor basics, you are doing it wrong.

    Both of these indicate a lack of understanding of how your system works, and a possible architectural issue. If your system only has two states, perfect and disaster, you will be down more than up. Build a fault tolerant application stack, monitor for the warning signs of degradation with Nagios, and use other metrics (log parsing, graphing of metrics, etc.) to help you find what might become a problem (or what caused an existing problem.)

    Monitoring (alerting, really) and metrics are two sides of the system health care coin. You need them both, but you can’t just pull a bunch of canned plugins in and call it complete or effective. Someone will have to configure, write scripts, set up graphs, and refine it.