DevOps Tools for the Monitoring Ecosystem

DevOps is a cultural and process transformation—the goal of it is to achieve a higher rate of change and quality of software. There are categories of tools to help facilitate this transformation through automation and increased measurement and visibility. You can’t have or observe progress without measurement, and throughout your transformation you need to maintain a high degree of visibility in order to move quickly, safely. Not only is the idea of cultural and process transformation daunting for many organizations, adopting the tools to help you on that journey is even more so, as it requires a great deal of learning.

An integrated set of DevOps tools for monitoring has the power to improve visibility and productivity, achieve higher-performing systems and establish cross-functional collaboration. The right toolset is more than the tools themselves—it’s about developing the culture, discipline and practices that come to define your product/service and your workplace.

This article will outline some of the best DevOps tools and practices when it comes to the monitoring ecosystem that help developer and operations teams work together effectively.

Monitoring Tools

A good monitoring platform lets you monitor infrastructure and application performance, whether on-prem, in the cloud or across containerized environments—so you have complete visibility into every system, all the time. Whether you want to monitor Kubernetes, IoT devices or bare metal, the right monitoring tool helps make it possible.

An effective monitoring tool improves system performance and productivity, and helps you reduce downtime. You can adequately plan for upgrades and new projects, and better allocate your time and resources. You can detect problems—and solve them—before they impact users.

There are a ton of great monitoring tools out there; I won’t go into them all here, but I’ll highlight the top three in terms of popularity.

Love it or hate it, Nagios is still very much a widely used tool and formed the foundation for monitoring as we know it today. Plus, the Nagios service check is actually awesome (and underappreciated). Where Nagios falls short, however, is when it comes to monitoring ephemeral environments, such as Kubernetes and Docker, at scale. In reality, Nagios falls short when it comes to monitoring at scale in any environment, whether that’s virtualized or not. Enter Prometheus, an open source monitoring tool (and, like Kubernetes, a part of the CNCF) that uses a pull-based architecture. Instead of pushing metrics to the monitoring tool, it pulls metrics from services. Due to Kubernetes’ built-in exporters, Prometheus is often the go-to for monitoring Kubernetes, but it falls short when it comes to monitoring older infrastructure (see more pros and cons of Prometheus here).

That brings me to Sensu, which offers the best of both worlds: you can reuse Nagios service checks in Sensu and monitor ephemeral environments at scale. If you have a strictly container-based environment, then Prometheus is a good choice; if you have a mix of multi-generational infrastructure (which we’re finding more often than not with our customers), then Sensu is the way to go.

Configuration Management Tools

Configuration management tools allow you to automate the provisioning and deployment of systems, enforce desired configurations and remediate configuration drift. By modeling your infrastructure as code, you can apply software delivery practices such as version control, automated testing and continuous delivery to infrastructure and applications.

Automating work that used to be manual, repetitive and error-prone results in greater speed, predictability and scalability—and the assurance of standardized configurations across test, developer and production environments. Eliminating snowflake servers reduces time (and headaches) and lets you deploy software faster and more reliably.

Popular configuration management tools include: Ansible, which is written in Python, agent-less and utilizes an imperative (as opposed to a declarative) approach. Puppet is another popular choice—it relies on a declarative config management approach and uses domain-specific language and an agent/master architecture. Finally, there’s Chef, which is written in Ruby and Erlang, and is modeled similar to Puppet.

Alerting Tools

Alerting tools provide both actionable and informational system alerts, and can be customized to fit the complexities of your systems. For example, your alerting system needs to be sensitive enough to cover an outage—but not so sensitive that you’re catching frequent, intermittent problems that users aren’t going to see and that would inundate you with needless alerts.

Alerting tools help lay the foundation for your alerting policies, so you can determine who to notify, how to track issues and outcomes and how to prioritize remediation. Popular tools include PagerDuty, which offers an on-call management platform with add-ons for analytics, event intelligence and automated incident response. There’s also ServiceNow, which utilizes automated workflows for ITSM as well as customer service and business processes. You can send your alerts into Slack to consolidate alerts into the same platform you use for collaborating with your team.

Metrics Storage

Once you’ve automated configuration management, alerting and monitoring, you’ll have a whole lot of data at your disposal to learn from. The challenge: How do you securely store and analyze it? You need a storage system that lets you aggregate and learn from system capacity, user behavior, service levels, security risks and more.

The insights you gain from your metrics inform decisions across all layers of your business, improving your ability to meet SLAs, satisfy customer expectations and make the case for new strategic investments. Data-driven decisions promote a culture of continuous learning and improvement.

Popular tools for storing metrics includes InfluxDB and TimescaleDB, time-series databases (TSDBs) that are well suited for long-term storage, and Splunk, which uses a search engine database model to store and query your data. Amazon Web Services (AWS) supports a wide range of storage purposes, including relational and non-relational databases, a data warehouse for analytics, a TSDB, a leger database to store transactions and more.

Visualization Tools

A visualization tool might be considered the pièce de résistance of your DevOps toolchain for monitoring: You get to combine all of your data, sort and visualize it, and display it on customizable dashboards.

Visualization tools provide context and meaning, allow you to track changes and improvements over time and give management a real-time view that helps guide strategic decisions. Customization options make it easy for team members to design and share their own dashboards.

Grafana is a popular open source visualization tool and can be used on top of a variety of different data stores, including Graphite, InfluxDB and Elasticsearch. For more on how Grafana can integrate with your monitoring and TSDB, check out this use case.

Evaluating Your DevOps Tools

No matter where you are in your DevOps journey, it’s wise to re-evaluate the tools you’re using and identify where you can fine tune. Think about the DevOps tools in the monitoring ecosystem as more than their capabilities. How you use them begins to define your habits, values and work culture—and accordingly, the quality of your product or service and the value you bring to your users.

— Sean Porter