Troubleshooting “silent failures” in an OpenStack cloud environment has been a tough problem to solve for many production users. eBay’s chief engineer of cloud, Subbu Allamaraju, whose team has been operating PayPal on an OpenStack-based private cloud since March 2015, has publicly defined one of his ongoing challenges as: “Too much noise, too many alerts, and still working on how to detect troubles before customers find out.”
For enterprises and service providers in various stages of rolling out OpenStack, tackling silent failures invariably starts with two questions:
- Why and where do silent failures occur?
- Can my Ops team receive push notifications of warnings earlier and proactively remediate them before silent failures materialize?
Silent failures happen because OpenStack Ops teams often don’t know which service domains to tune into for early-stage warnings. Even if they do, the warnings get drowned out by noise. For instance, your metrics and logs for the compute domain show a successful VM creation. Yet, a tenant is still not able to use her requested VM instance – a silent failure – while the warnings actually show up in metrics and logs outside compute domain, such as DHCP agent failure and identity database query timeout.
To catch a situation like this, OpenStack Ops teams need a real-time, data-driven approach that automatically detects early warnings in DHCP agent and identity service, then correlates them with VM creation, and finally push-notifies only the relevant domain experts to collaboratively remediate “tenant can’t use requested VM.” All noise has been removed and related warning signals have been clustered and contextualized together – automatically.
And Operators shouldn’t have to rely on writing rules to model silent failures. Writing rules will be too time consuming and are unable to cover all failure scenarios. There are too many service domains – at least 13 of them (e.g. compute, identity, dashboard and storage) – plus their service daemons and virtual switch tunnels that stitch them all together.
Using real-time, data driven management to resolve OpenStack silent failures has been a recent success for OpenStack. For example, a large OpenStack production shop was able to proactively detect OpenStack cell failures by receiving early notifications from outside the cell domain – CEPH and JBOD I/O failures.
By becoming data-driven and ingesting all events and status messages across the entire OpenStack into real-time, machine learning algorithms, OpenStack Ops team automatically receive cleaned and contextualized alerts, then use modern social collaboration technologies to rapidly troubleshoot issues, capturing tribal knowledge for the future. The new mantra for these teams should be “clean, contextualize and collaborate,” the 3 pillars of real-time situation management.
But real-time situation management doesn’t just stop here. It scales across the entire ecosystem that lives in and around OpenStack. Data sources include the ever-pervasive RabbitMQ message bus across all services, metrics and logs across the diverse tenant operating systems, and metrics and logs across all technology and vendors. In essence, monitoring the entire technology and vendor independent reference architecture can be rolled up and simplified as “situation management” vs. reacting to many seemingly unrelated “domain silo alerts.”
Real-time situation management can also sit above multiple OpenStack instances for load sharing and DR purposes, and the interactions with public cloud. For instance, Amazon CloudWatch can be quickly incorporated as a data feed, giving OpenStack Ops teams a “manager of managers” view across OpenStack hybrid clouds.
Real-time situation management essentially provides Ops team with a single pane of glass to see and resolve failures earlier, with amplified signals and suppressed noise. Doing so without relying on extensive rules offers much needed flexibility for Ops team to keep up with so many “unknown unknowns,” as the OpenStack community continues to evolve its entire reference architecture and ecosystem.
About the Author/Feng Meng
Feng Meng is a Senior Director, Product Marketing at Moogsoft. His expertise lies in IT service assurance, with 15-year technical marketing background in Application Performance Management (Bytemobile, acquired by Citrix), Network Performance Management (Cisco), Infrastructure Monitoring (VSS Monitoring), and Server Management (HP). Twitter: @fengmengmoog @moogsoft Linkedin: https://www.linkedin.com/in/fengmeng1970