DevOps.com

  • Latest
    • Articles
    • Features
    • Most Read
    • News
    • News Releases
  • Topics
    • AI
    • Continuous Delivery
    • Continuous Testing
    • Cloud
    • Culture
    • DataOps
    • DevSecOps
    • Enterprise DevOps
    • Leadership Suite
    • DevOps Practice
    • ROELBOB
    • DevOps Toolbox
    • IT as Code
  • Videos/Podcasts
    • Techstrong.tv Podcast
    • Techstrong.tv Video Podcast
    • Techstrong.tv - Twitch
    • DevOps Unbound
  • Webinars
    • Upcoming
    • On-Demand Webinars
  • Library
  • Events
    • Upcoming Events
    • On-Demand Events
  • Sponsored Content
  • Related Sites
    • Techstrong Group
    • Container Journal
    • Security Boulevard
    • Techstrong Research
    • DevOps Chat
    • DevOps Dozen
    • DevOps TV
    • Techstrong TV
    • Techstrong.tv Podcast
    • Techstrong.tv Video Podcast
    • Techstrong.tv - Twitch
  • Media Kit
  • About
  • Sponsor
  • AI
  • Cloud
  • Continuous Delivery
  • Continuous Testing
  • DataOps
  • DevSecOps
  • DevOps Onramp
  • Platform Engineering
  • Low-Code/No-Code
  • IT as Code
  • More
    • Application Performance Management/Monitoring
    • Culture
    • Enterprise DevOps
    • ROELBOB

Home » Blogs » DevOps Without Scale, Part 1

DevOps Without Scale, Part 1

Avatar photoBy: Jack Korabelnikov on April 1, 2016 8 Comments

When most people think of DevOps, what comes to mind first are continuous integration (CI) and continuous development (CD) pipelines. While they are arguably the most important parts of DevOps, they are not all of it. So, then, what is DevOps? It is a philosophy of how to build and operate software products and systems to provide the best user experience. It’s a way of thinking that blends the previously independent disciplines of development and operations so that software products can both evolve functionally and be stable operationally. In practical terms, it means development teams are involved in operations and, in many instances, fully own the operation of their products.

Recent Posts By Jack Korabelnikov
  • DevOps Without Scale, Part 2
Avatar photo More from Jack Korabelnikov
Related Posts
  • DevOps Without Scale, Part 1
  • The Business Value of Continuous Testing
  • Tales Of DevOps Discovery: Q&A from the Field
    Related Categories
  • Blogs
  • Doin' DevOps
    Related Topics
  • data
  • devops
  • metrics
  • user experience
  • web operations
  • web performance
Show more
Show less

The top companies are already there because they otherwise couldn’t deliver their massive and feature-rich products with 99.9 percent uptime. But there are still too many companies and developers who aren’t “doing it.” And it’s typically reflected in product quality.

TechStrong Con 2023Sponsorships Available

It seems DevOps has been stigmatized as difficult, requiring expensive tools and generally classified as needed if you’re big. In this blog I’ll walk you through different aspects of DevOps and share the business value behind each, the common implementation techniques and the free (or inexpensive) tools that will let you jump-start your DevOps in a matter of weeks.

Product Health: The User Experience

Our goal as engineers is to build products that will be used by others. Typically we call them users and sometimes we have a strenuous relationship with them. But if users don’t like our products and aren’t using them, we have failed at the very core of our job. So what do users want—besides features, that is?

Users want products that are reasonably fast and work when they need them. Your site going down is an obvious example of why it’s important to be aware of your product’s health. But there are more nuanced examples. Sometimes a new release breaks an edge case and isn’t caught by regression testing. Sometimes a service becomes slow—it still works and nobody is complaining, but it’s dangerously close to being unusable. You need to be aware of your users’ experience so you can be proactive in fixing problems and delivering quality products.

Thus, the first set of metrics you need to measure are user experience metrics. These typically include success/failure and latency of all external endpoints: REST, web pages, etc. If the users are external, you want to measure the experience from their point of view, which includes the network latency and HTML/CSS/JS execution time for web pages. If this was a car, these metrics would be your low coolant red light. If the light is on, you must stop as soon as possible and add coolant before your engine block cracks from overheating.

To measure and monitor customer experience for web pages, Google Analytics is by far the most popular choice. It’s easy to set up and gives you exactly what you’re looking for: error stats per page, page load times, including individual page components and even browser parsing time for HTML. You can slice and dice the data by geographic region, browser type and so on to zero in on improvement or problem areas. Google Analytics is a great tool for tracking other product aspects as well, and you can kill many birds with one stone by adopting it.

However, Google Analytics will provide data only as long as there are users and your site works. But what if the site breaks, maybe partially or maybe only for a certain geographic region? And what do you do for public APIs that aren’t instrumentable with Google Analytics? You’ll need a synthetic monitoring service. These services emulate clients by calling your APIs or opening your web pages in an automated fashion. They are typically scriptable and run from multiple places around the world. They can tell you when your services are up or down and can measure (emulated) user latencies by geographic region. Pingdom is one of the newer generation of such services that delivers what you need for a modest price. It also integrates with a graphing and alarming tool called Datadog, which I’ll talk about later.

Product Health: Technical Innards

Now you have a solid understanding of your user experience and can detect user problems in real time. But how do you find the root causes? And can you do even better by detecting user problems before they happen? What you need to do is go a level deeper and monitor key technical aspects of your apps so you can correlate them with user problems. The technical metrics are important to watch because they can be early signs of trouble, which can be averted if you act quickly.

This can be a dangerous territory as the first desire is to monitor everything. Please keep your cool here, lest you drown yourself in hundreds of metrics that nobody understands and miss the important ones. In software, failure typically happens at integration points, so that’s what you should focus on: database and cache interactions, calls to other services, disk operations, queue sizes (for pub/sub architectures) and so on. Typically you don’t need to instrument business logic, as it either works or doesn’t and problems are caught in testing. This is where it is easy to create many metrics in a blink of an eye. A notable (and rare) exception is load-intensive parts of code that you want to monitor for performance reasons.

If you see latency in database calls creeping up, cache hit ration going down or thread pool or queue size growing, you can frequently identify and fix the problem before it affects the users. Going even deeper, you want to keep an eye on the virtual machine (VM) and server performance. If your garbage collection times are increasing, if your thread count grows or if your server is showing signs of strain, it’s time to pay extra attention. If this was a car, these would be your check engine yellow light. If the light is on, you need to check it out soon but can keep on driving for a little bit longer.

The most popular way to measure application metrics is with code instrumentation. Presently, Coda Hale (a.k.a. Dropwizard) Metrics framework and its ports to most popular languages are the way to go. Given the abundance of documentation and integration with all sorts of systems, this isn’t difficult. There are a plethora of coding patterns or external libraries you can use to avoid writing boilerplate code. Note that you want to include as much as possible in measuring requests (serializing, logging, etc.) so you need to put this logic as high up the filter chain or middleware as you can.

The last piece of the puzzle is VM and server monitoring. Traditionally, this has been the territory of specialized ops tools, but Datadog, a service that I will talk about in the next section, provides this out of the box. Its agents run on each machine or VM and collect server metrics, JVM/JMX data, IIS info (yep, it supports even Windows) and more.

Processing and Displaying the Data

Now your code is instrumented and generates all sorts of interesting data. What do you do with it? And where does the data go in the first place? All this data is useless if it just sits there. You want to be able to look at the trends, correlate them and get a basic set of statistical information such as averages, means, standard deviations and so on. You also want to see pretty graphs (who doesn’t?) and be able to create dashboards. This is the cool part, where all the previous work comes to fruition.

Datadog is a great way to start. It provides everything you need and more. All you have to do is install its agent on each box and configure the Metrics framework to send data to Datadog. Most Metrics ports already include Datadog adapters so there is no extra coding required.

Now you have everything and can slice and dice the data whichever way you want. You can export data for further analysis. You can check it periodically for problems. You can even display dashboards on big screens around the office (looks cool, eh?).

Putting it All Together

I have seen all of this put together in a couple of weeks. However, if this whole area is new to you it might take longer, as you’ll be learning the concepts along with the tools. Sometimes the deadlines are tight and even a few weeks is a luxury. However, launching a product without any insight into user experience or product health is not a good idea. If I had limited time and had to prioritize, I’d start with instrumenting user experience metrics and hooking it up to Datadog. Even for web pages you can start with only monitoring the REST services that power them, or the controller layer, if you’re on model-view-controller. Soon you will realize how nice it is to have hard metrics and will want to add Google Analytics and deeper code monitoring to your product.

All of this might seem like a lot, but you’re only dealing with four tools here: Pingdom, Google Analytics, Metrics framework, and Datadog. If your product is an API then you don’t need Google Analytics and if your product is internal, you don’t need Pingdom.

As I mentioned earlier, Metrics framework is ported to multiple languages already. But even if you have specific requirements that aren’t supported out of the box, it’s easy to extend it. For example, at Guaranteed Rate we wrote and open-sourced the Metrics/DataDog adapter for .NET and a library for code instrumentation and DataDog integration for Clojure.

.NET is particularly known to be behind on DevOps. Until recently one of the reasons was the lack of tools; however, that’s not the case anymore. All tools mentioned here, as well as Part 2 and 3, work in .NET and have adapters and ports where needed. With .NET entering the new open-source age, there are no more excuses to stay behind.

Give yourself two weeks, read the docs, go one step at a time, and you’ll be surprised how far along you can get!

What’s Next?

Part 2 will cover what you can do to maximize product uptime based on the data you now have. It also will discuss making sense of the data, automated alerting and troubleshooting when the inevitable production problems happen.

About the Author/Jack Korabelnikov

JackKorabelnikov_avatar (small)Jack Korabelnikov wrote his first computer program when he was 10. Another 15 years later, after success in the role of senior engineer at Orbitz.com he transitioned into management. He still enjoys the tech side of things and continues to tinker with cool new tech. He is presently the VP of Engineering at Guaranteed Rate, leading the company’s technical evolution and the buildout of next-gen mortgage products.

Prior to that, he worked at Orbitz Worldwide in multiple business areas such as hotels, private label and B2B web services. He played a key role in rolling out agile and, most recently, was the main driver behind the continuous delivery and DevOps adoption across the organization.

[email protected] | LinkedIn

Filed Under: Blogs, Doin' DevOps Tagged With: data, devops, metrics, user experience, web operations, web performance

« End of an Era
Tales Of DevOps Discovery: Q&A from the Field »

Techstrong TV – Live

Click full-screen to enable volume control
Watch latest episodes and shows

Upcoming Webinars

Evolution of Transactional Databases
Monday, January 30, 2023 - 3:00 pm EST
Moving Beyond SBOMs to Secure the Software Supply Chain
Tuesday, January 31, 2023 - 11:00 am EST
Achieving Complete Visibility in IT Operations, Analytics, and Security
Wednesday, February 1, 2023 - 11:00 am EST

Sponsored Content

The Google Cloud DevOps Awards: Apply Now!

January 10, 2023 | Brenna Washington

Codenotary Extends Dynamic SBOM Reach to Serverless Computing Platforms

December 9, 2022 | Mike Vizard

Why a Low-Code Platform Should Have Pro-Code Capabilities

March 24, 2021 | Andrew Manby

AWS Well-Architected Framework Elevates Agility

December 17, 2020 | JT Giri

Practical Approaches to Long-Term Cloud-Native Security

December 5, 2019 | Chris Tozzi

Latest from DevOps.com

Stream Big, Think Bigger: Analyze Streaming Data at Scale
January 27, 2023 | Julia Brouillette
What’s Ahead for the Future of Data Streaming?
January 27, 2023 | Danica Fine
The Strategic Product Backlog: Lead, Follow, Watch and Explore
January 26, 2023 | Chad Sands
Atlassian Extends Automation Framework’s Reach
January 26, 2023 | Mike Vizard
Software Supply Chain Security Debt is Increasing: Here’s How To Pay It Off
January 26, 2023 | Bill Doerrfeld

TSTV Podcast

On-Demand Webinars

DevOps.com Webinar ReplaysDevOps.com Webinar Replays

GET THE TOP STORIES OF THE WEEK

Most Read on DevOps.com

What DevOps Needs to Know About ChatGPT
January 24, 2023 | John Willis
Microsoft Outage Outrage: Was it BGP or DNS?
January 25, 2023 | Richi Jennings
Five Great DevOps Job Opportunities
January 23, 2023 | Mike Vizard
Optimizing Cloud Costs for DevOps With AI-Assisted Orchestra...
January 24, 2023 | Marc Hornbeek
Dynatrace Survey Surfaces State of DevOps in the Enterprise
January 24, 2023 | Mike Vizard
  • Home
  • About DevOps.com
  • Meet our Authors
  • Write for DevOps.com
  • Media Kit
  • Sponsor Info
  • Copyright
  • TOS
  • Privacy Policy

Powered by Techstrong Group, Inc.

© 2023 ·Techstrong Group, Inc.All rights reserved.