Blogs

Defining Availability, Maintainability and Reliability in SRE

In the world of reliability engineering, you’ll frequently encounter the three “-ability” words: Availability, maintainability and reliability. They sound similar and have similar meanings. In fact, these words may seem so similar that it can be tempting to use them interchangeably.

That would be a mistake. Availability, maintainability and reliability all have distinct—if related—meanings, and they each play different roles in reliability operations.

Definitions of Availability, Maintainability and Reliability

Let’s start by succinctly defining each of the terms.

Availability, Defined

Availability is the extent to which an IT resource is ready to perform a task when requested.

Thus, an application or server that is responding to requests is available (even if it takes longer to respond than desired or otherwise operates suboptimally, in which case it may have a reliability issue but not an availability issue).

Availability is usually measured as a percentage. A resource that has 99% availability, for example, is one that is up and responding 99% of the time.

Maintainability, Defined

Maintainability is a measure of how quickly and easily a resource can be fixed when something goes wrong.

If a buggy application release can be quickly fixed by rolling back to a stable version, the application would have a high degree of maintainability. On the other hand, if you have a server that needs to be rebuilt manually after it fails, it’s not very maintainable.

Reliability, Defined

Reliability is the extent to which a resource functions as required upon request (as opposed to simply being available).

For example, if your users require an application to process each transaction within one second, and it does this 99%of the time while also maintaining a 99% availability level, that would be a relatively reliable application. In contrast, an application that responds to almost all requests but that suffers from high latency or error rates would not be very reliable, although it might be highly available.

Differences Between Availability, Maintainability and Reliability

The simplest way to spell out the differences between availability, maintainability and reliability is to highlight what’s unique about each concept.

Availability

Unlike maintainability and reliability, availability is an essentially binary metric, in the sense that a system is either available or it’s not. Although availability status can change over time, there is no such thing as varying degrees of availability.

Availability also stands out because it is often the most important metric in defining SLAs and SLOs. Contracts typically specify that resources will achieve a certain level of availability, defined in percentage terms. They may sometimes also specify metrics like response times, which are a reflection of reliability, but availability is more likely to be the most important metric within service agreements.

Maintainability

Maintainability is unique in that it’s a pretty subjective concept. A system that one SRE considers easy to maintain could seem difficult to maintain to another SRE. The methodologies that engineers leverage to optimize maintainability could vary, too; for instance, someone from a DevOps background may be more likely to focus on optimizing maintainability within the software delivery chain than someone trained in classic site reliability engineering, which focuses on automating IT operations through code more than on the software delivery process.

That said, automation tools help virtually every team maximize maintainability, no matter their preferences or background.

Reliability

Reliability stands out because it reflects how well a system performs. Here again, this is a somewhat subjective metric, because performance requirements may vary from one user to another. Nonetheless, reliability is the most effective means of measuring whether a system meets the performance levels it needs to, even if those requirements change over time.

Why do availability, maintainability and reliability matter?

Because availability, maintainability and reliability each measure different aspects of a system’s status, putting them together is a useful means of gaining insight into the overall reliability of a system.

In order to be reliable, a system requires both availability and maintainability. A system can’t be reliable if it’s not available. It’s also unlikely to be highly reliable if it takes a long time to fix issues due to low maintainability.

At the same time, by focusing on availability, maintainability and reliability individually, you can drill down into specific issues within the IT resources you manage. For instance, you might have systems that are high-performing when they’re available, but that have low reliability rates because of availability issues. In that case, you’ll know that an investment in increased availability is likely to yield the greatest reward for increasing overall reliability.

The Bottom Line

The bottom line: While availability, maintainability and reliability are all reflections of the quality of an IT resource, they measure quality in different ways. Tracking each category separately is important for ensuring that you know where your weakest links exist within overall system performance and health.

At the same time, however, it’s important to compare and correlate availability, maintainability and reliability data so that you can achieve continuous insight into a resource’s status. When you can track each item separately but also combine them together to gain a complete picture of your system, you are in the best position to optimize reliability operations.

Tags: application availabilityDevOps metricsreliabilitysite reliability engineeringSRE

2 years ago

JJ Tang

JJ is the co-founder of Rootly (YC S21), a Slack-native incident management solution. He is based in Toronto, Canada and previously lead product at Instacart and IBM. He is obsessed with developer productivity, F1, and his adopted dog.

Building an Open Source Observability Platform
By investing in open source frameworks and LGTM tools, SRE teams can effectively monitor their…
Our Infrastructure is Still Expanding
Infrastructure is expanding in almost every possible way, and this creates more of a burden…
Forget Shift Left: Why ‘No Shift’ is the Future of Software Innovation
A no shift strategy argues for developing and testing directly in production, bypassing the traditional…