DevOps without Scale, Part 2 (quick start guide)

In Part 1 of this series, I covered why it’s important to be aware of your product’s health and how you can do so. I covered the user experience and technical types of metrics and how to get started with collecting them. In this part of the series I will walk you through how to use all these metrics to maximize the user experience and product uptime.

Metrics, Metrics Everywhere

There are no silver bullets here and you have to be aware continuously of your product’s health and how it fares against user expectations. It surely takes time and effort, but it’s an essential part of building a good product. Imagine if chair designers didn’t care for how people use their creations; those chairs would not be comfortable to sit in. Similarly, understanding user needs is a paramount part for any software engineer.

Once you’ve established the basic understanding of your application patterns, you are not done. Every release is an opportunity to introduce more latency or failures and the baseline is ever-shifting. It is easy to significantly affect the latency and as easy to break an edge case and drive user failures up by a few percent with any release. The only way to ensure high-quality user experience is to be aware of your product’s health and your users’ expectations at all times.

This is the case where less is more and users prefer fewer features that work and are responsive. Try it out and see what your users say!

User Experience Monitoring

The first part of a good user experience is low error rates. For web products, these include all http error responses as well as all application errors returned to the user. For APIs and GUIs, this includes application failures (and http errors for REST APIs). One recurring comment I hear is that users submit invalid input, which invariably causes error responses and so it should not count. It is certainly useful to be able to differentiate hard failures from bad user input, but the line is more blurred than you think. For example, a new release might change input expectations and previously valid requests can start erroring out. Is that bad user input or did you just break your product?

In either case, if the failure numbers are high, users won’t be happy and you need to deal with this problem. The solution here can be non-technical: Maybe you need better documentation or user training, or maybe the API or UX design is cumbersome and drives misunderstanding and misuse. Whatever the case is, it must be addressed. Users are as likely to walk away from a cumbersome product as from a product that doesn’t work.

As with everything else, I suggest that you identify a few high-level and most important parts of the product and keep track of those metrics. If you do notice an issue, then you have the more detailed metrics on hand for a deep dive.

The other part of good user experience is performance, and the approach here is no different. Typically you need to track the latency of the same user actions as you’re tracking for failures. Performance tracking is where you really need to understand what’s acceptable and what’s not for your users and that depends on the domain and your users’ expectations. Two seconds to open a reservation on a public travel site seems like a long time, but two seconds to pull up a loan record for an internal user at a financial company is not bad.

An important fact to remember is that you will always have some failures in the system and that’s normal. No matter how great your product is, how clear its documentation, and how in tune you are with your users, there always will be some failures caused by bad user interactions. At some point, once you feel good about your product overall, you will have to assume that the given error rate is your baseline. Then you need to pay attention to variation over time and revisit critically every once in a while.

Predictive Monitoring

In addition to user experience metrics, there are key technical metrics that are good to watch, as they can point to upcoming problems. Some of the common ones are CPU and memory utilization, thread and database pool sizes, VM heap size and garbage collection times. Just like with user metrics, you need to know the operational signature of your application. But in addition, you need to understand how changes in these metrics impact your users, if at all.

A growing thread pool might mean that response times are slowing down or it could simply mean that more users are accessing your service. A growing email queue size means that emails aren’t going out as fast, but is it bad? Only knowing your domain and users will give you the answer. If the CPU is hovering at 80 percent everything is fine—for now, but you need to scale up quickly before a tiny addition of traffic kicks your whole app over.

When tracking hardware metrics everyone pays close attention to overutilization because that’s what causes outages. But what about underutilization? Performance fluctuates up and down and at times you can find yourself underutilizing resources. Paying attention to this is a good way to stay frugal and keep costs in check.

Alerting

How do you actually know when something breaks? You can’t stare at the graphs all day long. Once you get a feel for the normal characteristics of your system, you’re ready to create automated monitors and alerts. One of the most important things about automated alerting is that it always must be trustworthy. The moment you and your team start getting false alerts, everyone will learn to disregard some of them and the the value of the system will plummet. Thus I recommend you start with auto-alerting only on several most critical metrics. Run the whole system for a few weeks and slowly add more metrics over time.

Another aspect to be aware of when setting up automated alerting is metric variations over time, the most common one being day vs. night. Nobody wants to get paged because traffic is too low at 3 a.m. At the same time, if your metrics skyrocket or drop suddenly, it can be a cause for concern. Are you being DDoSed or scraped? Is there a network failure that’s preventing your service from being reached?

Tooling to the rescue, DataDog has built-in capability to create alarms, called Monitors. They provide a few different ways to ensure no false positives. For example, instead of setting absolute thresholds, you can create a change alert that fires only on deltas. And adding a new monitor only takes a few minutes.

Finally, someone needs to get paged out when an alert goes off. DataDog can do email notifications but nobody checks that at night and even during core hours it takes time to notice a new email. For critical issues like yours, someone needs to get notified immediately.

Services like PagerDuty and VictorOps have made it trivial to do this. As expected, DataDog integrates with both of them natively. Both services support on-call rotation scheduling with an escalation policy, calling out to phones, text messaging, emails, and more.

Production Troubleshooting

So you’ve been alarmed and need to quickly figure out what’s going on and what to do. Your next stop is typically logs, but looking through log files one server at a time is a sure way to waste quite a bit of time.

Log management tools and services have matured over time and there are good alternatives to Splunk. Loggly and LogEntries cover more than just the basics and provide indexing and search capabilities, graphing of log occurrence, understanding of JSON, and out-of-the-box integration with just about all logging frameworks out there. At their core they provide a quick way to look through all logs without having to go to individual servers. Moreover, you can monitor various aspects of logging à la DataDog and alarm (via VictorOps or PagerDuty) when metrics are off.

The above approach already solves most needs, but I’d like to touch on a more advanced problem, especially for a microservice architecture. When multiple services are involved, finding the failure at the entry point is easy but figuring out which back-end service failed in the call chain while processing the user request can be quite difficult.

An architectural pattern that I call log stitching solves this issue. With log stitching you can pull up all logs across your entire architecture for a given user request. This allows you to reconstruct the complete call chain and quickly find the root cause of failures. Conceptually this technique is straightforward: You generate a unique request ID at the entry point, pass it down the chain for every call, and log it. Now you can pull up all logs with that request ID by searching for it. In practice it can get complicated because you need to: a) ensure consistency across all calls, b) account for all entry points including batch jobs, and c) preserve the request ID internally in multithreaded services.

Log stitching is not something I recommend from the get-go; however, it’s definitely something to consider as your system matures. If you are going down the microservices path, this really becomes a must since tracing call chains get progressively more difficult as the number of services grows.

Putting it All Together

There is not much development work in putting it all together, but nevertheless you will need to put in some time. I’d like to emphasize how important it is to work in iterations. Set up one to three alarms, connect your DataDog to VictorOps or PageDuty, study them for a couple weeks and then move on. Keep in mind that you will not get it right at first; your thresholds will be off and you will misinterpret or misunderstand system behavior.

Operational pattern analysis is not trivial, but don’t spend too much time analyzing the hell out of everything up front. Configure some alarms, get them out there, learn and adjust as you go. When you find an alarm that went off erroneously, take some time to understand why. Specifically, you need to focus on why you thought it was the correct alarm setup vs. why you now know it is not. You will learn by leaps and bounds if you take this systematic approach.

As I mentioned above, configuring Loggly or LogEntries is super easy and you should do that right away. Log stitching, however, is an involved endeavor that requires planning. Due to its nature it only becomes useful when adopted by all, or most, of your systems. Thus, you should only start it when you feel the need and when you have the buy-in from the rest of your development organization.

What’s Next?

Part 3 will cover the topics of continuous integration and continuous delivery, breaking them down into multiple stages of complexity so you can pick and choose what makes the most sense for you from the cost/benefit point of view.