LogicMonitor Releases IT Downtime Detection and Mitigation Study

Although 99.999% availability may be IT’s ambition, it’s far from reality. In March 2019, Google’s Gmail experienced a 4.5-hour global outage. In the same month, Facebook suffered a 14-hour outage, its most massive outage to date, crippling app-accessibility worldwide.

Downtime and low availability will likely occur to all systems eventually, especially when introducing code changes. When outages occur, it’s truly the detection systems and smart mitigation processes that separate quick rebounds from long recoveries.

LogicMonitor has released its IT Downtime Detection and Mitigation Report, a survey of IT professionals to uncover trends and tactics impacting availability in 2020. The report surveyed over 300 IT design makers at organizations with 2,500 or more employees within the U.S., U.K., Australia and New Zealand.

Below, we examine the LogicMonitor report’s key findings. We’ll discover, on average, what sort of projects are causing the most outages and brownouts. We’ll also see what strategies top IT teams are using to prevent, detect and mitigate disruptions.

Which Transformation Initiatives Hurt Availability the Most?

Ninety-six percent of organizations surveyed had experienced at least one outage in the past three years. But why? One reason LogicMonitor conducted this report was to pinpoint which digital transformation initiatives and IT trends are the leading contributors to high-profile outages and brownouts.

When determining causation, answers revolve around novel cloud IT initiatives. Fifty-nine percent of respondents felt that mobile computing was causing more brownouts/outages. Fifty-seven percent found AI and edge computing were making outages more common. Other high-ranking reasons included digital transformation (57%) and IoT (53%).

According to the report, “LogicMonitor’s research suggests that IT decision-makers hold IT transformation initiatives responsible for increasingly frequent outages and brownouts.”

Though we generally view digital transformation initiatives in a positive light, they do come with caveats. The move from private to cloud infrastructure, for example, can bring unforeseen costs. Also, some authorities note a lack of talent specializing in cloud and hybrid environments. The report also suggested that accelerated cloud technologies may require time to mend before they realize positive business returns (and higher availability).

Critical Strategies for Preventing Outages

Another goal of the study was to discover what sort of tactics IT professionals are actively using to prevent outages.

The study exposed many strategies that IT folks are currently undertaking. The top three tactics were performing preventative maintenance (75%), reviewing system logs (71%) and increasing the capacity of systems (71%). Other top strategies included designing redundancy into systems, and keeping an eye on customer support tickets.

With the rise of a completely remote workforce, some IT executives do anticipate higher expenditures due to AI and cloud architecture, which could bring added pressure. However, it’s worthy to note the cloud hasn’t negatively affected all groups, or even reached all companies. According to another study conducted by Adobe and Fortune, only one in three organizations store their data in a public cloud. Of course, one could argue that in doing so, such companies substitute innovation for less breaking change.

Another necessary aspect to keep a vigilant eye on the digital ecosystem is proactive monitoring. According to the report, 74% of teams rely on proactive monitoring to detect outages. Continuous monitoring has been a philosophy trumpeted time and time again by DevOps professionals to enable a better release frequency and a window to detrimental interruptions.

Careful analysis ahead of time, active monitoring systems to spot problems and continuously reviewing system logs were the most important tactics found being applied by IT leaders to mitigate found outages and brownouts.

Best Practices: Prevent, Detect, Mitigate

In terms of takeaways (other than the ultimate LogicMonitor product sell), the IT Downtime Detection and Mitigation Report does reveal some interesting trends on what IT leaders view to be leading causes of system outages. It will be interesting to see how this evolves and what technologies mature (or remain headaches) into the future.

The report leaves IT professionals with some common, albeit helpful, advice to consider going forward: