Monitoring Your Data Center Like a Google SRE

Back in the early ’00s, when Google was beginning to expand its portfolio of services beyond search, it encountered a combination of challenges. Some of these emerged from familiar, classic disconnects between developers and operations folks or between IT services and line-of-business owners. Others were brand new, never-before-seen failure modes that arose from providing services on novel cloud platforms—and doing so at planetary scales.

To confront these challenges, Google began evolving a discipline called Site Reliability Engineering, about which the company published a very useful and fascinating book in 2016. SRE and DevOps (at least the contemporary version of DevOps that’s expanded into a vision for how IT operations should work in the era of cloud) share a lot of conceptual and an increasing amount of practical DNA; particularly true since cloud software and tooling have now evolved to enable ambitious folks to begin emulating parts of Google’s infrastructure using open source software such as Kubernetes. Google has used the statement, “Class SRE implements DevOps” to title a new (and growing) video playlist by Liz Fong-Jones and Seth Vargo of Google Cloud Platform, showing how and where these disciplines connect and nudging DevOps to consider some key SRE insights.  

Understanding Google SRE Principles

First, some basic principles:

Failure is normal – Achieving 100 percent uptime for a service is impossible, expensive or pointless (e.g., given the existence of masking error rates among your service’s dependencies).

Agree on SLIs and SLOs across your organization – Since failure is normal, you need to agree across your entire organization what availability means, what specific metrics are relevant in determining availability (called SLIs, or service level indicators) and what acceptable availability looks like, numerically, in terms of these metrics (called the SLO, or service level objective).

Use agreed-upon SLOs to calculate an “error budget” – SLO is used to define what SREs call the “error budget,” which is a numeric line in the sand (e.g., minutes of service downtime acceptable per month). The error budget is used to encourage collective ownership of service availability and blamelessly resolve disputes about balancing risk and stability. For example, if programmers are releasing risky new features too frequently and compromising availability, this will deplete the error budget. SREs can point to the at-risk error budget, and argue for halting releases and refocusing coders on efforts to improve system resilience.

This approach lets the organization as a whole balance speed and risk with stability effectively. Paying attention to this economy encourages investment in strategies that accelerate the business while minimizing risk: writing error- and chaos-tolerant apps, automating away pointless toil, advancing by means of small changes and evaluating “canary” deployments before proceeding with full releases.    

Monitoring systems are key to making this whole, elegant tranche of DevOps/SRE discipline work. It’s important to note (because, remember: Google isn’t running your data center) that this has nothing to do with what kind of technologies you’re monitoring, the processes you’re wrangling or the specific techniques you might apply to stay above your SLOs. In short, it makes just as much sense to apply SRE metrics discipline to conventional enterprise systems as it does to 12-factor apps running on container orchestration.

Applying Google SRE Principles

So here are a few things Google SRE can tell you about monitoring, specifically:

Alert only on failure, or on incipient failure – Alert exhaustion is a real thing, and “paging a human is an expensive use of an employee’s time.”

Monitoring is a significant engineering endeavor – Google SRE teams with a dozen or so members typically employ one or two monitoring specialists. But they don’t busy these experts by having them stare at realtimecharts and graphs to spot problems: that’s a kind of work SREs call “toil”—they think it’s ineffective and they know it doesn’t scale.

Post-hoc analysis, no magic – Google SREs like simple, fast monitoring systems that help them quickly figure out why problems occurred, after they occurred. They don’t trust magic solutions that try to automate root-cause analysis, and they try to keep alerting rules in general as simple as possible, without complex dependency hierarchies, except for (rare) parts of their systems that are in very stable, unambiguous states (their example of “stable”: when they’ve redirected end-user traffic away from a downed data center, systems can stop reporting on that data center’s latency). Elsewhere, their systems are in constant flux, which causes complex rule-sets to produce excessive alerts. One exception to this general rule about simplicity: Google SREs do build alerts that react to anomalous patterns in end-user request rates, since these affect usability and/or reflect external dependency failures (e.g., carrier failures).

Heavy use of “white box” monitoring – Google likes to perform deeply introspective monitoring of target systems grouped by application (called business service monitoring in Opsview Monitor). Viewing related metrics from all systems (e.g., databases, web servers) supporting an application lets them identify root causes with less ambiguity (e.g., is the database really slow, or is there a problem on the network link between the DB and the web host?)

Four golden signals – Because part of the point of monitoring is communication, Google SREs strongly favor building SLOs (and SLAs) on small groups of related, easily understood SLI metrics. As has been widely discussed, they believe that measuring “four golden signals”—latency, traffic/demand, errors and saturation—can pinpoint most problems, even in complex systems such as carrier orchestrators with limited workload visibility. It’s important to note, however, that this austere schematic doesn’t automatically confer simplicity, as some monitoring makers have suggested. Google notes that “errors” are intrinsically hugely diverse and range from easy to almost impossible to trap, and “saturation” often depends on monitoring constrained resources (e.g., CPU capacity, RAM, etc.) and carefully testing hypotheses about the levels at which utilization becomes problematic.

Conclusion

The bottom line is that good DevOps monitoring systems need to be more than do-it-yourself toolkits. Though flexibility and configurability are important, more so is the ability of a mature monitoring solution to offer distilled operational intelligence about specific systems and services under observation, along with the ability to group and visualize these systems collectively, as business services.

— John Jainschigg