DevOps.com

  • Latest
    • Articles
    • Features
    • Most Read
    • News
    • News Releases
  • Topics
    • AI
    • Continuous Delivery
    • Continuous Testing
    • Cloud
    • Culture
    • DevSecOps
    • Enterprise DevOps
    • Leadership Suite
    • DevOps Practice
    • ROELBOB
    • DevOps Toolbox
    • IT as Code
  • Videos/Podcasts
    • DevOps Chats
    • DevOps Unbound
  • Webinars
    • Upcoming
    • On-Demand Webinars
  • Library
  • Events
    • Upcoming Events
    • On-Demand Events
  • Sponsored Communities
    • AWS Community Hub
    • CloudBees
    • IT as Code
    • Rocket on DevOps.com
    • Traceable on DevOps.com
    • Quali on DevOps.com
  • Related Sites
    • Techstrong Group
    • Container Journal
    • Security Boulevard
    • Techstrong Research
    • DevOps Chat
    • DevOps Dozen
    • DevOps TV
    • Digital Anarchist
  • Media Kit
  • About
  • AI
  • Cloud
  • Continuous Delivery
  • Continuous Testing
  • DevSecOps
  • Leadership Suite
  • Practices
  • ROELBOB
  • Low-Code/No-Code
  • IT as Code
  • More
    • Application Performance Management/Monitoring
    • Culture
    • Enterprise DevOps

Home » Blogs » Doin' DevOps » DevOps culture meets the SLA

DevOps culture meets the SLA

By: David Owczarek on May 6, 2014 3 Comments

The goal of every SaaS provider should be 100% availability – it should be ingrained in their culture. But it’s far more common to see a series of nines. Numerically, these are clearly not the same value and culturally, they are a world apart. When you have an availability target that is less that 100%, you are explicitly tolerating some amount of downtime. It changes the requirements, the perspective of all those involved in delivering the application to your customers. When we think of the nines, I think we invariably fall into thinking about the differences between them – especially the technical differences. But to get to 100%, you need culture. A culture that is fanatically dedicated to achieving that goal.

Recent Posts By David Owczarek
  • An outage war room primer
More from David Owczarek
Related Posts
  • DevOps culture meets the SLA
  • How to Adopt an SRE Practice (When You’re not Google)
  • DevOps answers to high availability
    Related Categories
  • Blogs
  • Doin' DevOps
    Related Topics
  • SLA
  • uptime
Show more
Show less

A review

DevOps Connect:DevSecOps @ RSAC 2022

This is tired ground, but let’s take another look at what the most common SLA measurement is and what it means. System availability is measured as a ratio of the number of minutes of actual availability over some period of time – a week, a month, a quarter, a year. We usually talk in terms of the monthly SLA for a 30 day month. 99.9% availability is 43.8 minutes of downtime per month. 99.99% is 4.32 minutes per month, and 99.999% is just 25.9 seconds per month.

What I find interesting about these numbers is the implication of trying to assure a higher SLA. You can achieve 99.9% with process and dedication. But that won’t get you to 99.99%. To achieve that, you need at least fail-over automation. To get the fifth nine, you need self-healing automation. Each of these numbers represents an order of magnitude increase in availability. Likewise, the effort and expertise required to attain that availability is greater.

What’s your relationship to the SLA?

When there is a disruption of service, the first obligation an organization has is to restore that service as quickly as possible with the least amount of risk. In mature organizations, there are typically automated and self-healing processes that provide a first line of defense for the most common types of failures. But there are always unanticipated problems that can manifest at any time. The expertise required to address those varies depending on the nature of the problem. It might require someone who primarily identifies as development, or dba or security. Everyone else attending to that problem is secondary to the key resource required to fix it. I have spent many hours feeling helpless during outage war rooms because I was not person with the specific skills required to solve the problem. When you expand this problem out, you begin to see that in order to prevent the outage from occurring again, you need that key resource to be thinking about outage/problem mitigation all the time, rather than in response to a specific incident.

Sometime, during times of trouble like this, I’ll be approached and asked what we should do to address this problem. On many occasions, my answer has been, “we need to build a system that doesn’t have outages.” After an awkward silence, the asker begins to realize that the askee is actually quite serious. And I am. The problem is that to truly attain that goal, or something reasonably close to it, you need everyone involved at all times at some level.

SLAs are meaningless

About 15 years ago, I experienced a catastrophic failure in a name brand storage array I inherited by way of employment. It was one of those managed, completely redundant, fault-tolerant arrays that consumed a considerable amount of OpEx with a 99.5% availability in the SLA. When it inevitably failed (you knew that was coming, right?) and I looked up the contract, I found that we could not claim a material breach of performance unless the availability of the storage device fell below 85%. I was stunned. My business would be dead long before 85%.

Likewise, you can have a 99.95% month, but if the tiny amount of downtime happens to impact your biggest and most demanding customer, the SLA is immaterial. They become angry and are at risk as a customer because you have failed them. What I learned as an operations executive is that the commitment behind the SLA is far more important than the SLA itself. Put another way, you can make bad situations into positive ones if you handle them exceptionally well every time. That means ruthless dedication to uptime from everyone in the company, from rank-and-file individual contributors to the CEO, all the time, not just during the incident. If that really is a goal – and it’s infused into the culture of the company – everyone you interact with will see it and understand that commitment.

There’s another artifact of SLAs that can be detrimental. When the focus is on the availability number, the target is too low. The goal of every SaaS provider should be 100% uptime. When your goal is 99.9%, you effectively have a budget of downtime that you can draw from while still exceeding your goal. Here’s an example from my own experience. I once had a service that required periodic downtime for maintenance. One month, I had incurred 30 minutes of downtime via an unplanned incident against that service – a service with a 99.9% SLA. We also had a software update planned for that month that was going to require 20 minutes of downtime. I had the update moved to the following month so that it wouldn’t blow the SLA for the current month. It didn’t make much sense at the time and didn’t feel very customer-centric either, but as Dr. Eliyahu M. Goldratt wrote in The Goal, “Tell me how you will measure me and I will tell you how I will behave.” I had a chance to hit the SLA for that month, so I did everything I could to hit it. Given that I deferred a software update that would have delivered new features to my customers, the choice I made seemed arbitrary and was counter to being a customer-focused organization.

It takes a culture

One truly exciting thing about the DevOps movement is that it advocates the spread of knowledge throughout the organization. From my selfish operations perspective, that means that individuals who formerly would only be contacted in the event of dire emergency now have a much more visceral connection to the availability and performance of their services. By being exposed to the operations experience, those formerly external parties now see challenges and problems. And most people in our industry love a good challenge. What I get from this exchange is the injection of new perspectives and a diverse and healthy debate about subjects that have long since been worn out living only in the realm of operations. To have the attention of that level and breadth of expertise is a fantastic opportunity for me to evangelize availability and performance. More importantly, if all that talent is super motivated to achieve 100% uptime, performance and availability are going to follow. Anything less is a compromise and I don’t want to compromise on behalf of my customers. I want them delighted 100% of the time.

Filed Under: Blogs, Doin' DevOps Tagged With: SLA, uptime

Sponsored Content
Featured eBook
DevOps: Mastering the Human Element

DevOps: Mastering the Human Element

While building constructive culture, engaging workers individually and helping staff avoid burnout have always been organizationally demanding, they are intensified by the continuous, always-on notion of DevOps.  When we think of work burnout, we often think of grueling workloads and deadline pressures. But it also has to do with mismatched ... Read More
« The DevOps master process: design
DevOps and faster feedback: fewer problems, better features (part 1) »

TechStrong TV – Live

Click full-screen to enable volume control
Watch latest episodes and shows

Upcoming Webinars

Deploying Microservices With Pulumi & AWS Lambda
Tuesday, June 28, 2022 - 3:00 pm EDT
Boost Your Java/JavaScript Skills With a Multi-Experience Platform
Wednesday, June 29, 2022 - 3:30 pm EDT
Closing the Gap: Reducing Enterprise AppSec Risks Without Disrupting Deadlines
Thursday, June 30, 2022 - 11:00 am EDT

Latest from DevOps.com

Developer’s Guide to Web Application Security
June 24, 2022 | Anas Baig
Cloudflare Outage Outrage | Yet More FAA 5G Stupidity
June 23, 2022 | Richi Jennings
The Age of Software Supply Chain Disruption
June 23, 2022 | Bill Doerrfeld
Four Steps to Avoiding a Cloud Cost Incident
June 22, 2022 | Asim Razzaq
At Some Point, We’ve Shifted Too Far Left
June 22, 2022 | Don Macvittie

Get The Top Stories of the Week

  • View DevOps.com Privacy Policy
  • This field is for validation purposes and should be left unchanged.

Download Free eBook

The State of Open Source Vulnerabilities 2020
The State of Open Source Vulnerabilities 2020

Most Read on DevOps.com

Survey Uncovers Depth of Open Source Software Insecurity
June 21, 2022 | Mike Vizard
One Year Out: What Biden’s EO Means for Software Devs
June 20, 2022 | Tim Mackey
At Some Point, We’ve Shifted Too Far Left
June 22, 2022 | Don Macvittie
Open Source Coder Tool Helps Devs Build Cloud Spaces
June 20, 2022 | Mike Vizard
Cloudflare Outage Outrage | Yet More FAA 5G Stupidity
June 23, 2022 | Richi Jennings

On-Demand Webinars

DevOps.com Webinar ReplaysDevOps.com Webinar Replays
  • Home
  • About DevOps.com
  • Meet our Authors
  • Write for DevOps.com
  • Media Kit
  • Sponsor Info
  • Copyright
  • TOS
  • Privacy Policy

Powered by Techstrong Group, Inc.

© 2022 ·Techstrong Group, Inc.All rights reserved.