DevOps.com

  • Latest
    • Articles
    • Features
    • Most Read
    • News
    • News Releases
  • Topics
    • AI
    • Continuous Delivery
    • Continuous Testing
    • Cloud
    • Culture
    • DataOps
    • DevSecOps
    • Enterprise DevOps
    • Leadership Suite
    • DevOps Practice
    • ROELBOB
    • DevOps Toolbox
    • IT as Code
  • Videos/Podcasts
    • Techstrong.tv Podcast
    • Techstrong.tv - Twitch
    • DevOps Unbound
  • Webinars
    • Upcoming
    • Calendar View
    • On-Demand Webinars
  • Library
  • Events
    • Upcoming Events
    • Calendar View
    • On-Demand Events
  • Sponsored Content
  • Related Sites
    • Techstrong Group
    • Cloud Native Now
    • Security Boulevard
    • Techstrong Research
    • DevOps Chat
    • DevOps Dozen
    • DevOps TV
    • Techstrong TV
    • Techstrong.tv Podcast
    • Techstrong.tv - Twitch
  • Media Kit
  • About
  • Sponsor
  • AI
  • Cloud
  • CI/CD
  • Continuous Testing
  • DataOps
  • DevSecOps
  • DevOps Onramp
  • Platform Engineering
  • Sustainability
  • Low-Code/No-Code
  • IT as Code
  • More
    • Application Performance Management/Monitoring
    • Culture
    • Enterprise DevOps
    • ROELBOB
Hot Topics
  • Chronosphere Adds Professional Services to Jumpstart Observability
  • Friend or Foe? ChatGPT's Impact on Open Source Software
  • VMware Streamlines IT Management via Cloud Foundation Update
  • Revolutionizing the Nine Pillars of DevOps With AI-Engineered Tools
  • No, Dev Jobs Aren’t Dead: AI Means ‘Everyone’s a Programmer’? ¦ Interesting Intel VPUs

Home » Blogs » Site Reliability Engineering: How to Make the Operations Side of DevOps Actually Work

Site Reliability Engineering: How to Make the Operations Side of DevOps Actually Work

Avatar photoBy: Dominic Wellington on April 28, 2017 2 Comments

The DevOps movement often has been accused of focusing too much on the first half (Development) and not enough on the second half (Operations). Certainly there has been more attention paid to deployment of payloads than to operating running systems, leading to the dismissal of the handover between Dev and Ops as “throwing it over the wall.”.

Recent Posts By Dominic Wellington
  • Want a Successful Software Build? Just Say No
  • A Vendor Guy Goes To Monitorama
  • Ops: The Other DevOps
Avatar photo More from Dominic Wellington
Related Posts
  • Site Reliability Engineering: How to Make the Operations Side of DevOps Actually Work
  • Why is Site Reliability Engineering Important?
  • Making Performance More Than a Best Practice
    Related Categories
  • AI
  • Blogs
  • DevOps Toolbox
    Related Topics
  • AIOps
  • alert correlation
  • Enterprise IT Operations
  • IT operations
  • ITSM
  • monitoring
  • Moogsoft
  • SRE
Show more
Show less

Lately, we have seen the emergence of a new focus on the stability and reliability of the systems that are the targets of those deployments, with the creation of the new practice of site reliability engineering (SRE). This has been a welcome addition to the IT toolbox, but it can still seem to put all the onus on Ops teams to catch whatever comes over the wall.

The Problem with Site Reliability Engineering = IT Infrastructure

In SRE, the IT infrastructure is expected to be highly automatic and self-healing in the face of any events. Here is the problem: That approach works well for foreseeable events, but less well for the unforeseen ones. Because of that factor, it works well in environments that run few types of individual workload types, but do so at very large scale. If you’re thinking that sounds like Google and Facebook, you’d be right.

What this amounts to is massive engineering of the infrastructure to withstand known or foreseeable problems. However, typical enterprise IT environments are not like that. A large bank might have thousands of applications, each of which must accommodate changes on a pace that is dictated from outside the IT environment. Applying assumptions from one environment to the other is asking for trouble.

To illustrate this, let’s look at an example from outside IT. Recently was the anniversary of the sinking of the Titanic, on April 15, 1912. The RMS Titanic itself embodies the principles of SRE, engineered to prevent or survive all manner of emergencies. The ship was equipped with state-of-the-art everything to transport its passengers in comfort and safety across the Atlantic.

As we all know, that plan did not quite work out. A combination of unexpected changes in the environment (more icebergs than normal), business imperatives to maintain speed and mishandled warnings about the ice from other ships led the Titanic to disaster.

While the consequences of an IT failure are rarely quite as dramatic as the sinking of an ocean liner, many of the same factors apply. IT administrators may believe that in the worst case their backups and disaster recovery plans will be sufficient to handle any problems, but these plans are designed around known and foreseeable problems. Unexpected circumstances can easily lead to cascading failures, which is why it is critical to be ready for the unexpected when it—inevitably— occurs.

The Problem with Traditional IT Ops = Static Models

The key flaw in IT Ops is reliance on static models of the world. These models come in many shapes – the CMDB is the “model” that many think of first, but static rulesets are also a model. The most dangerous models, though, are the ones that are invisible: namely, the filters that are put in place to determine which alerts and events are even worth considering.

This is the flaw that ultimately sank the Titanic: The crew were unable to correctly priorities events and react to them in time to avoid failure. It is also the flaw that causes countless much more minor issues every day in data centers everywhere: the filesystem that filled up with logs, taking down the database that the critical business application relied on—and the filter that had not forwarded that alert because it usually was harmless. Or the loss of one leg of the redundant network link, which was ignored because the other leg was still up, so there was still a link—until that one failed too, and an entire site went dark.

The key factor is that the filters were not wrong to suppress those alerts. By definition, informational or warning alerts are not the same as major or critical ones, and most of the time they can be safely ignored, to be dealt with later.

Every now and then, though, a pattern of those alerts, if understood correctly, can identify a future problem in the making. These developing issues could be nipped in the bud, if only Ops teams had enough hours in the day to look at them and review them. But, of course, they don’t—every Ops team I have ever met is drowning in issues, and behind those there is a long and lengthening to-do list.

What Can DevOps Do to Make Operations Better

New approaches are emerging to make the principles behind SRE more widely accessible and applicable. In particular, more dynamic noise reduction and correlation is now possible, to sift those important alerts from the constant background noise and put them together into a picture of what is really happening. The key factor is to be able to do this in real-time and without a human having to plan out laboriously what are all the possible scenarios that they might need to know about in the future.

Gartner has called this new approach Algorithmic IT Operations, or AIOps. The idea is to bring together all possible sources of events, whether those are alerts from the compute or network infrastructure, transaction slowdowns reported by an APM tool, automated deployments being run from a CI/CD toolchain, or anything else the might conceivably be relevant. All of this information then can be sifted by algorithms to understand what is actually important to Ops and brought to the attention of the right specialists who can work on the issues and get them solved fast. Part of that process is also the integration with systems of record (which generally means IT service management), and with automation and orchestration tools that can accelerate remediation activities.

This is how we can get the Ops side of DevOps to where they need to be to be able to accommodate the ever-accelerating pace of change—whether from Dev teams wanting to run ever more frequent deployments, or from business users needing help with their own goals—or from the next unexpected issue to come down the line.

— Dominic Wellington

Filed Under: AI, Blogs, DevOps Toolbox Tagged With: AIOps, alert correlation, Enterprise IT Operations, IT operations, ITSM, monitoring, Moogsoft, SRE

« How to Manage Risk, Regulation and Compliance Differently —and Better
10 Tips to Start Scaling DevOps: Moving the enterprise from isolated pilots to organization-wide success »

Techstrong TV – Live

Click full-screen to enable volume control
Watch latest episodes and shows

Upcoming Webinars

Securing Your Software Supply Chain with JFrog and AWS
Tuesday, June 6, 2023 - 1:00 pm EDT
Maximize IT Operations Observability with IBM i Within Splunk
Wednesday, June 7, 2023 - 1:00 pm EDT
Secure Your Container Workloads in Build-Time with Snyk and AWS
Wednesday, June 7, 2023 - 3:00 pm EDT

GET THE TOP STORIES OF THE WEEK

Sponsored Content

PlatformCon 2023: This Year’s Hottest Platform Engineering Event

May 30, 2023 | Karolina Junčytė

The Google Cloud DevOps Awards: Apply Now!

January 10, 2023 | Brenna Washington

Codenotary Extends Dynamic SBOM Reach to Serverless Computing Platforms

December 9, 2022 | Mike Vizard

Why a Low-Code Platform Should Have Pro-Code Capabilities

March 24, 2021 | Andrew Manby

AWS Well-Architected Framework Elevates Agility

December 17, 2020 | JT Giri

Latest from DevOps.com

Chronosphere Adds Professional Services to Jumpstart Observability
June 2, 2023 | Mike Vizard
Friend or Foe? ChatGPT’s Impact on Open Source Software
June 2, 2023 | Javier Perez
VMware Streamlines IT Management via Cloud Foundation Update
June 2, 2023 | Mike Vizard
Revolutionizing the Nine Pillars of DevOps With AI-Engineered Tools
June 2, 2023 | Marc Hornbeek
No, Dev Jobs Aren’t Dead: AI Means ‘Everyone’s a Programmer’? ¦ Interesting Intel VPUs
June 1, 2023 | Richi Jennings

TSTV Podcast

On-Demand Webinars

DevOps.com Webinar ReplaysDevOps.com Webinar Replays

Most Read on DevOps.com

What Is a Cloud Operations Engineer?
May 30, 2023 | Gilad David Maayan
Forget Change, Embrace Stability
May 31, 2023 | Don Macvittie
Five Great DevOps Job Opportunities
May 30, 2023 | Mike Vizard
No, Dev Jobs Aren’t Dead: AI Means ‘Everyone’s a Programmer’? ¦ Interesting Intel VPUs
June 1, 2023 | Richi Jennings
Checkmarx Brings Generative AI to SAST and IaC Security Tools
May 31, 2023 | Mike Vizard
  • Home
  • About DevOps.com
  • Meet our Authors
  • Write for DevOps.com
  • Media Kit
  • Sponsor Info
  • Copyright
  • TOS
  • Privacy Policy

Powered by Techstrong Group, Inc.

© 2023 ·Techstrong Group, Inc.All rights reserved.