DevOps Practice

How to Improve Your Uptime Strategy

Outages happen, it’s inevitable. But, unplanned downtime often comes with substantial costs—not only in terms of recovery and revenue loss, but also customer happiness, brand reputation and employee morale.

While there is no foolproof way for companies to avoid outages, there are steps they can take to help improve uptime. Development teams should determine what measurements and thresholds for uptime are sufficient for the company, take preventative actions to reduce the mean time between failures (MTBF) and implement tools that provide awareness of coding errors to minimize the mean time to repair or resolve.

Defining and Measuring Uptime

Uptime is defined as the time when an application or service is operational. A service-level agreement (SLA) between service providers and customers defines the level of uptime providers have to meet. But “operational” can mean different things to different organizations. For example, a provider may have an SLA for how long it takes its system to process an event. Events processed too slowly violate the SLA even though the system is still operational at some level.

Companies that provide digital services and applications should clearly define SLAs with customers and then measure their performance against those objectives. Dropping below the agreed-upon expectations can lead to customer frustration and contract termination. Also, the time and effort development teams spend updating customers, tickets and status pages can distract them from working on new service features.

In SLAs, uptime is measured by “nines,” with each nine hard to achieve as it means significantly less downtime. Here’s a common breakdown for measuring uptime:

  • One nine (90% uptime) is 36.5 days of downtime per year.
  • Two nines (99% uptime) is 3.65 days of downtime per year.
  • Three nines (99.9% uptime) is 8.77 hours of downtime per year.
  • Four nines (99.99% uptime) is 52.6 minutes of downtime per year.
  • Five nines (99.999% uptime) is 5.26 minutes of downtime per year.

Not all companies need five nines. Because each additional nine requires increased effort and cost, development teams should determine how to balance this investment of time and resources in a way that best meets business needs, while focusing on resolving issues quickly when they occur.

Preventing Errors in Production

Validating code can prevent errors in production and minimize MTBF. This type of testing needs to be done in the early stages of the development cycle. To reach the three, four and five-nines levels of uptime, automated testing is necessary. While such testing involves an upfront investment in writing unit tests, it saves time and increases efficiency, as a failing unit test means a specific block of code is not working as expected.

After validating code, another step developers can take is to use a canary deployment. This is a strategy where new code is pushed out to a subset of users. If something goes wrong, then only a small percentage of users are affected, and developers can roll back to a stable version. Once developers are comfortable with the results of the new version, they can gradually deploy the new code to all users.

Uncovering Context with Errors

Receiving an alert for an error is not enough to enable quick remediation. Developers need to know when errors occur, what the impact was, and where the errors reside in the code. Having this level of awareness and context minimizes mean time to repair, and the best way to achieve this is through application monitoring.

Monitoring provides developers with automated real-time insights into how their service is performing. When alerted to issues, developers can use source code management information with monitoring insights to identify the change that caused the problem and get the issue to the appropriate developer to resolve it. Developers can’t rely on manual processes and customer feedback to be alerted to issues. These methods are not fast enough. It can take hours to receive and understand an issue and then manually roll back to an earlier version or forward to a new version.

Code can be complex and fragile. It touches virtually every aspect of business and life. When code stops working, it’s expensive and disruptive. Downtime, even for a brief moment, has a disastrous impact on reputation, retention and revenue.

To improve uptime, developer teams should adopt code validation and monitoring strategies. By balancing prevention and awareness of errors, they can reduce broken code in production, decrease time to resolution when errors do occur and improve service uptime.

Neil Manvar

Neil Manvar

Neil Manvar is a solutions engineering manager at Sentry, an application monitoring and error tracking software company that helps software teams discover, triage and prioritize errors in real time.

Recent Posts

Building an Open Source Observability Platform

By investing in open source frameworks and LGTM tools, SRE teams can effectively monitor their apps and gain insights into…

5 hours ago

To Devin or Not to Devin?

Cognition Labs' Devin is creating a lot of buzz in the industry, but John Willis urges organizations to proceed with…

6 hours ago

Survey Surfaces Substantial Platform Engineering Gains

While most app developers work for organizations that have platform teams, there isn't much consistency regarding where that team reports.

22 hours ago

EP 43: DevOps Building Blocks Part 6 – Day 2 DevOps, Operations and SRE

Day Two DevOps is a phase in the SDLC that focuses on enhancing, optimizing and continuously improving the software development…

23 hours ago

Survey Surfaces Lack of Significant Observability Progress

A global survey of 500 IT professionals suggests organizations are not making a lot of progress in their ability to…

23 hours ago

EP 42: DevOps Building Blocks Part 5: Flow, Bottlenecks and Continuous Improvement

In part five of this series, hosts Alan Shimel and Mitch Ashley are joined by Bryan Cole (Tricentis), Ixchel Ruiz…

23 hours ago