I think, most of the DevOps discussions are centered around testing the code change, packaging and deploying it in production in an automated fashion. However, stinging site-down issues would eventually force any nascent production engineering team to make monitoring a high priority down the road.
If functional, integration and acceptance tests are run to verify a code change before that was deployed in production, in my opinion, similar checks should be part of monitoring to make sure that the application works fine in production also, as designed. In production, code change is not only the factor that could impact the application. Updates of OS, network configuration changes, automatic upgrades of third-party tools, availability of system resources etc. can also impact application in production. But, I am yet to come across any framework that integrates testing and monitoring efforts.
In a traditional sysadmin centric environment, the scope of monitoring doesn’t go beyond infrastructure layer. The operations team that owns monitoring might not have visibility into what needs to be monitored beyond infrastructure. For the latter, out-of-the-box solutions are readily available on popular monitoring platforms like Nagios.
Here is an attempt to classify the monitoring requirements in production. The requirements are generic and are applicable for any system hosted in-house, in a data center or in the cloud. A systematic and proactive plan for testing code changes and monitoring production would help to prevent incidents in production. Classification of tests and tools to implement them have been around for a while and that hardly require any additional attention. This article is an attempt to classify the monitoring types and to document their implementation methods.
Monitoring Infrastructure
The infrastructure that hosts an application environment would be made up of multiple components: servers, storage devices, load balancer etc. Checking the health of these devices is the most basic requirement of monitoring. The popular monitoring platforms support this feature out-of-the-box. Very little customization is required except for setting up right thresholds on those metrics for alerting.
Monitoring Platform
An application would typically be built using multiple third-party tools such as databases, both RDBMS (MySQL, Postgres) and NoSQL (MongoDB, Couchbase, Cassandra) data repositories; full-text search engines (ElasticSearch) ; BigData platforms (Hadoop, Spark); messaging systems (RabbitMQ); memory object caching systems (Memcache,Redis); and BI and reporting tools (Microstrategy, Tableau). Checking the health of these application components is important too. Most of these tools provide some interface, mainly via REST API, that can be leveraged to implement plugins on the main monitoring platform.
Monitoring Application
Having a healthy infrastructure and platform is not good enough for an application to function correctly. Buggy code from a recent deployment or third-party component issues or incompatible changes with external systems can cause application failures. Application level checks can be implemented to detect such issues. As mentioned before, a functional or integration test would unearth such issues in a testing/staging environment, and, an equivalent of that should be implemented in the production environment also.
The implementation of application level monitoring could be simplified by building hooks or API endpoints in the application. Monitoring is usually an after-thought and the requirement of such instrumentation is overlooked during the design phase of an application. The participation of DevOps team in the design reviews improves operability of a system. Planning for application level monitoring in production is one area where DevOps can provide inputs.
Monitoring Business
The applications run in production to meet certain business goals. You can have an application that runs flawlessly on a healthy infrastructure but still the business might not be meeting its goals. It is important to provide that feedback to the business at the earliest to take corrective actions which might trigger enhancements of the application features and/or require the way business is run using the application. These efforts should only complement the more complex BI based data analysis methods that could provide deeper insights into the state of the business. The business level monitoring can be based on transactional data readily available in the data repositories and the data aggregates generated by the BI systems.
Both application and business level monitoring are company specific, and, plugins have to be developed for such monitoring requirements. Implementing some framework to access standard sources of information such as databases and REST APIs from the monitoring platform could minimize the requirement of building plugins from scratch every time.
Last-Mile Monitoring
A monitoring platform deployed in the same cloud or data center environment where the applications also run cannot check on the end-user experience. To address that gap, there are several SaaS products are in the market, such as Catchpoint and Apica. These services are backed up by actual infrastructure to monitor the applications in specific geographical locations. For example, if you are keen on knowing how your mobile app performs on iPhones in Chicago, that could be tracked using the service provider’s testing infrastructure in Chicago.
Alerts are setup on these tools to notify the site reliability team if the application is not accessible externally or if there are performance issues with the application.
Log Aggregation
In a production environment, huge amount of information is logged in various log files, by operating system, platform components and application. They will get some attention when issues happen and normally are ignored otherwise. The traditional monitoring tools like Nagios couldn’t handle the constantly changing log files except for alerting on some patterns.
The advent of log aggregation tools like Logstash, Loggly and Splunk changed that scenario. Using the aggregated and indexed logs, it is possible to detect issues that would have gone unnoticed earlier. Alerts can be setup based the info available in the indexed log data. For example, Splunk provides a custom query language to search index for operational insights. Using APIs provided by these tools, the alerting could actually be integrated with the main monitoring platform.
To leverage the aggregation and indexing capabilities of these tools, structured data outputs can be generated by the application or scripts that will be indexed by the log aggregation tool later. Such data aggregation can be used for generating data for reporting applications. For example, if storage usage on a set of computing nodes has to be tracked, a daily storage usage report can be generated on related nodes which the aggregation tool can track. Weekly and monthly aggregates could also be computed once the daily aggregates would be available.
Monitoring the Monitoring
It is important to make sure that the monitoring infrastructure itself is up and running. Disabling alerting during a deployment and forgetting about enabling it later is one of the common oversights I have seen in operations. Such missteps are hard to monitor and only improvement in deployment process could address such issues.
Pinging hosts
If there are multiple instances of the monitoring application running, or if there is a stand-by node, then cross checks can be implemented to verify the availability of hosts used for monitoring.
In AWS, CloudWatch can be used to monitor the availability of an EC2 node.
Health-check for monitoring
Checking on the availability of monitoring UI and activity in monitoring application’s log files would ensure that monitoring system itself is fully functional and it continues to watch for issues in production environment. If a log aggregation tool is used, tracking monitoring application’s log files would be the most effective method to check if there is activity in the log file. The same index can be queried for any potential issues also by using standard keywords like “Error”, and “Exception”.
Conclusion
Monitoring efforts are normally in response to issues happen in production. A systematic approach to rolling out monitoring can minimize the reactive ways it is normally associated with. Proactive monitoring adds to better user experience, and, it avoids costly data reprocessing and rollback in production.