Outages happen, it’s inevitable. But, unplanned downtime often comes with substantial costs—not only in terms of recovery and revenue loss, but also customer happiness, brand reputation and employee morale.
While there is no foolproof way for companies to avoid outages, there are steps they can take to help improve uptime. Development teams should determine what measurements and thresholds for uptime are sufficient for the company, take preventative actions to reduce the mean time between failures (MTBF) and implement tools that provide awareness of coding errors to minimize the mean time to repair or resolve.
Uptime is defined as the time when an application or service is operational. A service-level agreement (SLA) between service providers and customers defines the level of uptime providers have to meet. But “operational” can mean different things to different organizations. For example, a provider may have an SLA for how long it takes its system to process an event. Events processed too slowly violate the SLA even though the system is still operational at some level.
Companies that provide digital services and applications should clearly define SLAs with customers and then measure their performance against those objectives. Dropping below the agreed-upon expectations can lead to customer frustration and contract termination. Also, the time and effort development teams spend updating customers, tickets and status pages can distract them from working on new service features.
In SLAs, uptime is measured by “nines,” with each nine hard to achieve as it means significantly less downtime. Here’s a common breakdown for measuring uptime:
Not all companies need five nines. Because each additional nine requires increased effort and cost, development teams should determine how to balance this investment of time and resources in a way that best meets business needs, while focusing on resolving issues quickly when they occur.
Validating code can prevent errors in production and minimize MTBF. This type of testing needs to be done in the early stages of the development cycle. To reach the three, four and five-nines levels of uptime, automated testing is necessary. While such testing involves an upfront investment in writing unit tests, it saves time and increases efficiency, as a failing unit test means a specific block of code is not working as expected.
After validating code, another step developers can take is to use a canary deployment. This is a strategy where new code is pushed out to a subset of users. If something goes wrong, then only a small percentage of users are affected, and developers can roll back to a stable version. Once developers are comfortable with the results of the new version, they can gradually deploy the new code to all users.
Receiving an alert for an error is not enough to enable quick remediation. Developers need to know when errors occur, what the impact was, and where the errors reside in the code. Having this level of awareness and context minimizes mean time to repair, and the best way to achieve this is through application monitoring.
Monitoring provides developers with automated real-time insights into how their service is performing. When alerted to issues, developers can use source code management information with monitoring insights to identify the change that caused the problem and get the issue to the appropriate developer to resolve it. Developers can’t rely on manual processes and customer feedback to be alerted to issues. These methods are not fast enough. It can take hours to receive and understand an issue and then manually roll back to an earlier version or forward to a new version.
Code can be complex and fragile. It touches virtually every aspect of business and life. When code stops working, it’s expensive and disruptive. Downtime, even for a brief moment, has a disastrous impact on reputation, retention and revenue.
To improve uptime, developer teams should adopt code validation and monitoring strategies. By balancing prevention and awareness of errors, they can reduce broken code in production, decrease time to resolution when errors do occur and improve service uptime.
By investing in open source frameworks and LGTM tools, SRE teams can effectively monitor their apps and gain insights into…
Cognition Labs' Devin is creating a lot of buzz in the industry, but John Willis urges organizations to proceed with…
While most app developers work for organizations that have platform teams, there isn't much consistency regarding where that team reports.
Day Two DevOps is a phase in the SDLC that focuses on enhancing, optimizing and continuously improving the software development…
A global survey of 500 IT professionals suggests organizations are not making a lot of progress in their ability to…
In part five of this series, hosts Alan Shimel and Mitch Ashley are joined by Bryan Cole (Tricentis), Ixchel Ruiz…