DevOps.com

  • Latest
    • Articles
    • Features
    • Most Read
    • News
    • News Releases
  • Topics
    • AI
    • Continuous Delivery
    • Continuous Testing
    • Cloud
    • Culture
    • DataOps
    • DevSecOps
    • Enterprise DevOps
    • Leadership Suite
    • DevOps Practice
    • ROELBOB
    • DevOps Toolbox
    • IT as Code
  • Videos/Podcasts
    • Techstrong.tv Podcast
    • Techstrong.tv Video Podcast
    • Techstrong.tv - Twitch
    • DevOps Unbound
  • Webinars
    • Upcoming
    • On-Demand Webinars
  • Library
  • Events
    • Upcoming Events
    • On-Demand Events
  • Sponsored Content
  • Related Sites
    • Techstrong Group
    • Container Journal
    • Security Boulevard
    • Techstrong Research
    • DevOps Chat
    • DevOps Dozen
    • DevOps TV
    • Techstrong TV
    • Techstrong.tv Podcast
    • Techstrong.tv Video Podcast
    • Techstrong.tv - Twitch
  • Media Kit
  • About
  • Sponsor
  • AI
  • Cloud
  • Continuous Delivery
  • Continuous Testing
  • DataOps
  • DevSecOps
  • DevOps Onramp
  • Platform Engineering
  • Low-Code/No-Code
  • IT as Code
  • More
    • Application Performance Management/Monitoring
    • Culture
    • Enterprise DevOps
    • ROELBOB
Hot Topics
  • Platform Engineering: Creating a Paved Path to Reduce Developer Toil
  • Running Serverless in Production: 7 Critical Best Practices
  • Where Does Observability Stand Today, and Where is it Going Next?
  • Five Great DevOps Job Opportunities
  • A Freelancer's Workflow

Home » Blogs » Enterprise DevOps » Proactive Monitoring

Proactive Monitoring

Avatar photoBy: Thomas Theakanath on July 22, 2015 4 Comments

I think, most of the DevOps discussions are centered around testing the code change, packaging and deploying it in production in an automated fashion. However, stinging site-down issues would eventually force any nascent production engineering team to make monitoring a high priority down the road.

Recent Posts By Thomas Theakanath
  • Break-Even Analysis: Understanding the Impact of Automation
  • DevOps Stack on a Shoestring Budget
Avatar photo More from Thomas Theakanath
Related Posts
  • Proactive Monitoring
  • Synthetic Monitoring: The Start of the Monitoring Journey
  • Electric Cloud and Dynatrace Partnership Shifts Feedback Loops Farther Left in the Software Delivery Pipeline
    Related Categories
  • Blogs
  • Enterprise DevOps
    Related Topics
  • monitoring
  • proactive devops
Show more
Show less

If functional, integration and acceptance tests are run to verify a code change before that was deployed in production, in my opinion, similar checks should be part of monitoring to make sure that the application works fine in production also, as designed. In production, code change is not only the factor that could impact the application. Updates of OS, network configuration changes, automatic upgrades of third-party tools, availability of system resources etc. can also impact application in production. But, I am yet to come across any framework that integrates testing and monitoring efforts.

TechStrong Con 2023Sponsorships Available

In a traditional sysadmin centric environment, the scope of monitoring doesn’t go beyond infrastructure layer. The operations team that owns monitoring might not have visibility into what needs to be monitored beyond infrastructure. For the latter, out-of-the-box solutions are readily available on popular monitoring platforms like Nagios.

Here is an attempt to classify the monitoring requirements in production. The requirements are generic and are applicable for any system hosted in-house, in a data center or in the cloud. A systematic and proactive plan for testing code changes and monitoring production would help to prevent incidents in production. Classification of tests and tools to implement them have been around for a while and that hardly require any additional attention. This article is an attempt to classify the monitoring types and to document their implementation methods.

Monitoring Infrastructure

The infrastructure that hosts an application environment would be made up of multiple components: servers, storage devices, load balancer etc. Checking the health of these devices is the most basic requirement of monitoring. The popular monitoring platforms support this feature out-of-the-box. Very little customization is required except for setting up right thresholds on those metrics for alerting.

Monitoring Platform

An application would typically be built using multiple third-party tools such as databases, both RDBMS (MySQL, Postgres) and NoSQL (MongoDB, Couchbase, Cassandra) data repositories; full-text search engines (ElasticSearch) ; BigData platforms (Hadoop, Spark); messaging systems (RabbitMQ); memory object caching systems (Memcache,Redis); and BI and reporting tools (Microstrategy, Tableau). Checking the health of these application components is important too. Most of these tools provide some interface, mainly via REST API, that can be leveraged to implement plugins on the main monitoring platform.

Monitoring Application

Having a healthy infrastructure and platform is not good enough for an application to function correctly. Buggy code from a recent deployment or third-party component issues or incompatible changes with external systems can cause application failures. Application level checks can be implemented to detect such issues. As mentioned before, a functional or integration test would unearth such issues in a testing/staging environment, and, an equivalent of that should be implemented in the production environment also.

The implementation of application level monitoring could be simplified by building hooks or API endpoints in the application. Monitoring is usually an after-thought and the requirement of such instrumentation is overlooked during the design phase of an application. The participation of DevOps team in the design reviews improves operability of a system. Planning for application level monitoring in production is one area where DevOps can provide inputs.

Monitoring Business

The applications run in production to meet certain business goals. You can have an application that runs flawlessly on a healthy infrastructure but still the business might not be meeting its goals. It is important to provide that feedback to the business at the earliest to take corrective actions which might trigger enhancements of the application features and/or require the way business is run using the application. These efforts should only complement the more complex BI based data analysis methods that could provide deeper insights into the state of the business. The business level monitoring can be based on transactional data readily available in the data repositories and the data aggregates generated by the BI systems.

Both application and business level monitoring are company specific, and, plugins have to be developed for such monitoring requirements. Implementing some framework to access standard sources of information such as databases and REST APIs from the monitoring platform could minimize the requirement of building plugins from scratch every time.

Last-Mile Monitoring

A monitoring platform deployed in the same cloud or data center environment where the applications also run cannot check on the end-user experience. To address that gap, there are several SaaS products are in the market, such as Catchpoint and Apica. These services are backed up by actual infrastructure to monitor the applications in specific geographical locations. For example, if you are keen on knowing how your mobile app performs on iPhones in Chicago, that could be tracked using the service provider’s testing infrastructure in Chicago.

Alerts are setup on these tools to notify the site reliability team if the application is not accessible externally or if there are performance issues with the application.

Log Aggregation

In a production environment, huge amount of information is logged in various log files, by operating system, platform components and application. They will get some attention when issues happen and normally are ignored otherwise. The traditional monitoring tools like Nagios couldn’t handle the constantly changing log files except for alerting on some patterns.

The advent of log aggregation tools like Logstash, Loggly and Splunk changed that scenario. Using the aggregated and indexed logs, it is possible to detect issues that would have gone unnoticed earlier. Alerts can be setup based the info available in the indexed log data. For example, Splunk provides a custom query language to search index for operational insights. Using APIs provided by these tools, the alerting could actually be integrated with the main monitoring platform.

To leverage the aggregation and indexing capabilities of these tools, structured data outputs can be generated by the application or scripts that will be indexed by the log aggregation tool later. Such data aggregation can be used for generating data for reporting applications. For example, if storage usage on a set of computing nodes has to be tracked, a daily storage usage report can be generated on related nodes which the aggregation tool can track. Weekly and monthly aggregates could also be computed once the daily aggregates would be available.

Monitoring the Monitoring

It is important to make sure that the monitoring infrastructure itself is up and running. Disabling alerting during a deployment and forgetting about enabling it later is one of the common oversights I have seen in operations. Such missteps are hard to monitor and only improvement in deployment process could address such issues.

Pinging hosts

If there are multiple instances of the monitoring application running, or if there is a stand-by node, then cross checks can be implemented to verify the availability of hosts used for monitoring.

In AWS, CloudWatch can be used to monitor the availability of an EC2 node.

Health-check for monitoring

Checking on the availability of monitoring UI and activity in monitoring application’s log files would ensure that monitoring system itself is fully functional and it continues to watch for issues in production environment. If a log aggregation tool is used, tracking monitoring application’s log files would be the most effective method to check if there is activity in the log file. The same index can be queried for any potential issues also by using standard keywords like “Error”, and “Exception”.

Conclusion

Monitoring efforts are normally in response to issues happen in production. A systematic approach to rolling out monitoring can minimize the reactive ways it is normally associated with. Proactive monitoring adds to better user experience, and, it avoids costly data reprocessing and rollback in production.

Filed Under: Blogs, Enterprise DevOps Tagged With: monitoring, proactive devops

« The Software BOM Squad
Why DevOps for the database must include three way analysis »

Techstrong TV – Live

Click full-screen to enable volume control
Watch latest episodes and shows

Upcoming Webinars

Automating Day 2 Operations: Best Practices and Outcomes
Tuesday, February 7, 2023 - 3:00 pm EST
Shipping Applications Faster With Kubernetes: Myth or Reality?
Wednesday, February 8, 2023 - 1:00 pm EST
Why Current Approaches To "Shift-Left" Are A DevOps Antipattern
Thursday, February 9, 2023 - 1:00 pm EST

Sponsored Content

The Google Cloud DevOps Awards: Apply Now!

January 10, 2023 | Brenna Washington

Codenotary Extends Dynamic SBOM Reach to Serverless Computing Platforms

December 9, 2022 | Mike Vizard

Why a Low-Code Platform Should Have Pro-Code Capabilities

March 24, 2021 | Andrew Manby

AWS Well-Architected Framework Elevates Agility

December 17, 2020 | JT Giri

Practical Approaches to Long-Term Cloud-Native Security

December 5, 2019 | Chris Tozzi

Latest from DevOps.com

Where Does Observability Stand Today, and Where is it Going Next?
February 6, 2023 | Tomer Levy
Five Great DevOps Job Opportunities
February 6, 2023 | Mike Vizard
Azure Migration Strategy: Tools, Costs and Best Practices
February 3, 2023 | Gilad David Maayan
Blameless Integrates Incident Management Platform With Opsgenie
February 3, 2023 | Mike Vizard
OpenAI Hires 1,000 Low Wage Coders to Retrain Copilot | Netflix Blocks Password Sharing
February 2, 2023 | Richi Jennings

TSTV Podcast

On-Demand Webinars

DevOps.com Webinar ReplaysDevOps.com Webinar Replays

GET THE TOP STORIES OF THE WEEK

Most Read on DevOps.com

OpenAI Hires 1,000 Low Wage Coders to Retrain Copilot | Netflix Blocks Password Sharing
February 2, 2023 | Richi Jennings
Automation Challenges Holding DevOps Back
February 1, 2023 | Mike Vizard
New Relic Bolsters Observability Platform
January 30, 2023 | Mike Vizard
Jellyfish Adds Tool to Visualize Software Development Workflows
January 31, 2023 | Mike Vizard
Cisco AppDynamics Survey Surfaces DevSecOps Challenges
January 31, 2023 | Mike Vizard
  • Home
  • About DevOps.com
  • Meet our Authors
  • Write for DevOps.com
  • Media Kit
  • Sponsor Info
  • Copyright
  • TOS
  • Privacy Policy

Powered by Techstrong Group, Inc.

© 2023 ·Techstrong Group, Inc.All rights reserved.