DevOps culture meets the SLA

The goal of every SaaS provider should be 100% availability – it should be ingrained in their culture. But it’s far more common to see a series of nines. Numerically, these are clearly not the same value and culturally, they are a world apart. When you have an availability target that is less that 100%, you are explicitly tolerating some amount of downtime. It changes the requirements, the perspective of all those involved in delivering the application to your customers. When we think of the nines, I think we invariably fall into thinking about the differences between them – especially the technical differences. But to get to 100%, you need culture. A culture that is fanatically dedicated to achieving that goal.

A review

This is tired ground, but let’s take another look at what the most common SLA measurement is and what it means. System availability is measured as a ratio of the number of minutes of actual availability over some period of time – a week, a month, a quarter, a year. We usually talk in terms of the monthly SLA for a 30 day month. 99.9% availability is 43.8 minutes of downtime per month. 99.99% is 4.32 minutes per month, and 99.999% is just 25.9 seconds per month.

What I find interesting about these numbers is the implication of trying to assure a higher SLA. You can achieve 99.9% with process and dedication. But that won’t get you to 99.99%. To achieve that, you need at least fail-over automation. To get the fifth nine, you need self-healing automation. Each of these numbers represents an order of magnitude increase in availability. Likewise, the effort and expertise required to attain that availability is greater.

What’s your relationship to the SLA?

When there is a disruption of service, the first obligation an organization has is to restore that service as quickly as possible with the least amount of risk. In mature organizations, there are typically automated and self-healing processes that provide a first line of defense for the most common types of failures. But there are always unanticipated problems that can manifest at any time. The expertise required to address those varies depending on the nature of the problem. It might require someone who primarily identifies as development, or dba or security. Everyone else attending to that problem is secondary to the key resource required to fix it. I have spent many hours feeling helpless during outage war rooms because I was not person with the specific skills required to solve the problem. When you expand this problem out, you begin to see that in order to prevent the outage from occurring again, you need that key resource to be thinking about outage/problem mitigation all the time, rather than in response to a specific incident.

Sometime, during times of trouble like this, I’ll be approached and asked what we should do to address this problem. On many occasions, my answer has been, “we need to build a system that doesn’t have outages.” After an awkward silence, the asker begins to realize that the askee is actually quite serious. And I am. The problem is that to truly attain that goal, or something reasonably close to it, you need everyone involved at all times at some level.

SLAs are meaningless

About 15 years ago, I experienced a catastrophic failure in a name brand storage array I inherited by way of employment. It was one of those managed, completely redundant, fault-tolerant arrays that consumed a considerable amount of OpEx with a 99.5% availability in the SLA. When it inevitably failed (you knew that was coming, right?) and I looked up the contract, I found that we could not claim a material breach of performance unless the availability of the storage device fell below 85%. I was stunned. My business would be dead long before 85%.

Likewise, you can have a 99.95% month, but if the tiny amount of downtime happens to impact your biggest and most demanding customer, the SLA is immaterial. They become angry and are at risk as a customer because you have failed them. What I learned as an operations executive is that the commitment behind the SLA is far more important than the SLA itself. Put another way, you can make bad situations into positive ones if you handle them exceptionally well every time. That means ruthless dedication to uptime from everyone in the company, from rank-and-file individual contributors to the CEO, all the time, not just during the incident. If that really is a goal – and it’s infused into the culture of the company – everyone you interact with will see it and understand that commitment.

There’s another artifact of SLAs that can be detrimental. When the focus is on the availability number, the target is too low. The goal of every SaaS provider should be 100% uptime. When your goal is 99.9%, you effectively have a budget of downtime that you can draw from while still exceeding your goal. Here’s an example from my own experience. I once had a service that required periodic downtime for maintenance. One month, I had incurred 30 minutes of downtime via an unplanned incident against that service – a service with a 99.9% SLA. We also had a software update planned for that month that was going to require 20 minutes of downtime. I had the update moved to the following month so that it wouldn’t blow the SLA for the current month. It didn’t make much sense at the time and didn’t feel very customer-centric either, but as Dr. Eliyahu M. Goldratt wrote in The Goal, “Tell me how you will measure me and I will tell you how I will behave.” I had a chance to hit the SLA for that month, so I did everything I could to hit it. Given that I deferred a software update that would have delivered new features to my customers, the choice I made seemed arbitrary and was counter to being a customer-focused organization.

It takes a culture

One truly exciting thing about the DevOps movement is that it advocates the spread of knowledge throughout the organization. From my selfish operations perspective, that means that individuals who formerly would only be contacted in the event of dire emergency now have a much more visceral connection to the availability and performance of their services. By being exposed to the operations experience, those formerly external parties now see challenges and problems. And most people in our industry love a good challenge. What I get from this exchange is the injection of new perspectives and a diverse and healthy debate about subjects that have long since been worn out living only in the realm of operations. To have the attention of that level and breadth of expertise is a fantastic opportunity for me to evangelize availability and performance. More importantly, if all that talent is super motivated to achieve 100% uptime, performance and availability are going to follow. Anything less is a compromise and I don’t want to compromise on behalf of my customers. I want them delighted 100% of the time.