DevOps Without Scale, Part 1

When most people think of DevOps, what comes to mind first are continuous integration (CI) and continuous development (CD) pipelines. While they are arguably the most important parts of DevOps, they are not all of it. So, then, what is DevOps? It is a philosophy of how to build and operate software products and systems to provide the best user experience. It’s a way of thinking that blends the previously independent disciplines of development and operations so that software products can both evolve functionally and be stable operationally. In practical terms, it means development teams are involved in operations and, in many instances, fully own the operation of their products.

The top companies are already there because they otherwise couldn’t deliver their massive and feature-rich products with 99.9 percent uptime. But there are still too many companies and developers who aren’t “doing it.” And it’s typically reflected in product quality.

It seems DevOps has been stigmatized as difficult, requiring expensive tools and generally classified as needed if you’re big. In this blog I’ll walk you through different aspects of DevOps and share the business value behind each, the common implementation techniques and the free (or inexpensive) tools that will let you jump-start your DevOps in a matter of weeks.

Product Health: The User Experience

Our goal as engineers is to build products that will be used by others. Typically we call them users and sometimes we have a strenuous relationship with them. But if users don’t like our products and aren’t using them, we have failed at the very core of our job. So what do users want—besides features, that is?

Users want products that are reasonably fast and work when they need them. Your site going down is an obvious example of why it’s important to be aware of your product’s health. But there are more nuanced examples. Sometimes a new release breaks an edge case and isn’t caught by regression testing. Sometimes a service becomes slow—it still works and nobody is complaining, but it’s dangerously close to being unusable. You need to be aware of your users’ experience so you can be proactive in fixing problems and delivering quality products.

Thus, the first set of metrics you need to measure are user experience metrics. These typically include success/failure and latency of all external endpoints: REST, web pages, etc. If the users are external, you want to measure the experience from their point of view, which includes the network latency and HTML/CSS/JS execution time for web pages. If this was a car, these metrics would be your low coolant red light. If the light is on, you must stop as soon as possible and add coolant before your engine block cracks from overheating.

To measure and monitor customer experience for web pages, Google Analytics is by far the most popular choice. It’s easy to set up and gives you exactly what you’re looking for: error stats per page, page load times, including individual page components and even browser parsing time for HTML. You can slice and dice the data by geographic region, browser type and so on to zero in on improvement or problem areas. Google Analytics is a great tool for tracking other product aspects as well, and you can kill many birds with one stone by adopting it.

However, Google Analytics will provide data only as long as there are users and your site works. But what if the site breaks, maybe partially or maybe only for a certain geographic region? And what do you do for public APIs that aren’t instrumentable with Google Analytics? You’ll need a synthetic monitoring service. These services emulate clients by calling your APIs or opening your web pages in an automated fashion. They are typically scriptable and run from multiple places around the world. They can tell you when your services are up or down and can measure (emulated) user latencies by geographic region. Pingdom is one of the newer generation of such services that delivers what you need for a modest price. It also integrates with a graphing and alarming tool called Datadog, which I’ll talk about later.

Product Health: Technical Innards

Now you have a solid understanding of your user experience and can detect user problems in real time. But how do you find the root causes? And can you do even better by detecting user problems before they happen? What you need to do is go a level deeper and monitor key technical aspects of your apps so you can correlate them with user problems. The technical metrics are important to watch because they can be early signs of trouble, which can be averted if you act quickly.

This can be a dangerous territory as the first desire is to monitor everything. Please keep your cool here, lest you drown yourself in hundreds of metrics that nobody understands and miss the important ones. In software, failure typically happens at integration points, so that’s what you should focus on: database and cache interactions, calls to other services, disk operations, queue sizes (for pub/sub architectures) and so on. Typically you don’t need to instrument business logic, as it either works or doesn’t and problems are caught in testing. This is where it is easy to create many metrics in a blink of an eye. A notable (and rare) exception is load-intensive parts of code that you want to monitor for performance reasons.

If you see latency in database calls creeping up, cache hit ration going down or thread pool or queue size growing, you can frequently identify and fix the problem before it affects the users. Going even deeper, you want to keep an eye on the virtual machine (VM) and server performance. If your garbage collection times are increasing, if your thread count grows or if your server is showing signs of strain, it’s time to pay extra attention. If this was a car, these would be your check engine yellow light. If the light is on, you need to check it out soon but can keep on driving for a little bit longer.

The most popular way to measure application metrics is with code instrumentation. Presently, Coda Hale (a.k.a. Dropwizard) Metrics framework and its ports to most popular languages are the way to go. Given the abundance of documentation and integration with all sorts of systems, this isn’t difficult. There are a plethora of coding patterns or external libraries you can use to avoid writing boilerplate code. Note that you want to include as much as possible in measuring requests (serializing, logging, etc.) so you need to put this logic as high up the filter chain or middleware as you can.

The last piece of the puzzle is VM and server monitoring. Traditionally, this has been the territory of specialized ops tools, but Datadog, a service that I will talk about in the next section, provides this out of the box. Its agents run on each machine or VM and collect server metrics, JVM/JMX data, IIS info (yep, it supports even Windows) and more.

Processing and Displaying the Data

Now your code is instrumented and generates all sorts of interesting data. What do you do with it? And where does the data go in the first place? All this data is useless if it just sits there. You want to be able to look at the trends, correlate them and get a basic set of statistical information such as averages, means, standard deviations and so on. You also want to see pretty graphs (who doesn’t?) and be able to create dashboards. This is the cool part, where all the previous work comes to fruition.

Datadog is a great way to start. It provides everything you need and more. All you have to do is install its agent on each box and configure the Metrics framework to send data to Datadog. Most Metrics ports already include Datadog adapters so there is no extra coding required.

Now you have everything and can slice and dice the data whichever way you want. You can export data for further analysis. You can check it periodically for problems. You can even display dashboards on big screens around the office (looks cool, eh?).

Putting it All Together

I have seen all of this put together in a couple of weeks. However, if this whole area is new to you it might take longer, as you’ll be learning the concepts along with the tools. Sometimes the deadlines are tight and even a few weeks is a luxury. However, launching a product without any insight into user experience or product health is not a good idea. If I had limited time and had to prioritize, I’d start with instrumenting user experience metrics and hooking it up to Datadog. Even for web pages you can start with only monitoring the REST services that power them, or the controller layer, if you’re on model-view-controller. Soon you will realize how nice it is to have hard metrics and will want to add Google Analytics and deeper code monitoring to your product.

All of this might seem like a lot, but you’re only dealing with four tools here: Pingdom, Google Analytics, Metrics framework, and Datadog. If your product is an API then you don’t need Google Analytics and if your product is internal, you don’t need Pingdom.

As I mentioned earlier, Metrics framework is ported to multiple languages already. But even if you have specific requirements that aren’t supported out of the box, it’s easy to extend it. For example, at Guaranteed Rate we wrote and open-sourced the Metrics/DataDog adapter for .NET and a library for code instrumentation and DataDog integration for Clojure.

.NET is particularly known to be behind on DevOps. Until recently one of the reasons was the lack of tools; however, that’s not the case anymore. All tools mentioned here, as well as Part 2 and 3, work in .NET and have adapters and ports where needed. With .NET entering the new open-source age, there are no more excuses to stay behind.

Give yourself two weeks, read the docs, go one step at a time, and you’ll be surprised how far along you can get!

What’s Next?

Part 2 will cover what you can do to maximize product uptime based on the data you now have. It also will discuss making sense of the data, automated alerting and troubleshooting when the inevitable production problems happen.

About the Author/Jack Korabelnikov

Jack Korabelnikov wrote his first computer program when he was 10. Another 15 years later, after success in the role of senior engineer at Orbitz.com he transitioned into management. He still enjoys the tech side of things and continues to tinker with cool new tech. He is presently the VP of Engineering at Guaranteed Rate, leading the company’s technical evolution and the buildout of next-gen mortgage products.

Prior to that, he worked at Orbitz Worldwide in multiple business areas such as hotels, private label and B2B web services. He played a key role in rolling out agile and, most recently, was the main driver behind the continuous delivery and DevOps adoption across the organization.

jack@korabelnikov.com | LinkedIn