IT Administration

Are We Nearing the End of IT Service Outages?

Over the last two decades, the evolution of IT has been more like something a revolution—in some cases, the transformation even unpinning real-life uprisings. Today we’re acutely aware of advances in technology and the internet, particularly as the global COVID-19 pandemic creates a sweeping reliance on tele-everything and online commerce to sustain companies’ livelihood.

The current surge in online traffic underscores the massive strides made across IT, ensuring critical support functions largely go uninterrupted—quietly impacting millions of lives. Behind the scenes, ITOps teams continue working to ensure the customer experience is resilient and incidents are addressed even before end users notice an issue. The irony of these seamless digital experiences, though, is that the delivery is built on escalating complexity. Providers seeking agility, speed and efficiency are increasingly adopting virtualization, microservices, containerization and cloud-based services. These capabilities often meet their designated needs, but the trade-offs can be costlier than expected.

IT Complexity Demands Smart, Machine Intelligence

The industry has rallied around AIOps, looking to AI and machine learning to scale to modern complexity. The goal is to evolve from human-driven monitoring and analysis to automated detection and remediation. The human inability to keep up with service alerts derived from machine-driven operations has created a gulf between machine speed and the traditional, event-correlation approach to troubleshooting.

The issue is clearly—and commonly—illustrated by the war room-style positioning assumed when a service degradation occurs and ITOps teams are besieged by demands for situation reports, impact assessments and projected resolution timelines. Amid a “sea of red,” when the operator has no idea which of the volumes of trouble tickets to work on first or where to focus attention, it’s undeniable that events correlation is wholly inadequate in navigating the modern-day minefields of data exhaust. It’s like diagnosing an illness solely by taking a patient’s temperature.

Applying advanced AI/ML will improve the thermometer, but it won’t fundamentally improve the patient’s diagnosis. Similarly, it’s much more promising to work through service interruptions by taking a holistic approach, tracking and connecting behavioral attributes and anomalies across the IT estate instead of solely relying on event data.

Behavioral correlation provides a deeper, more contextual understanding of what’s happening at the service level, starting with dynamic baselines of “normal” activity levels upon which ML is applied to detect and flag anomalies captured across a broad variety of real-time data. With an aggregated, service-level topology, it’s much easier to assess service health through availability and risk, so that decision-makers can prioritize what’s most pressing and visualize potential business impact.

Traversing New Realities: Technology as a Front-Line Offense

As the global COVID-19 pandemic continues to transform how people live and work, it’s clear that technology is the most common and often the most important tool in moving forward. By now most of the workforce has straightened out the initial kinks in telework and figured out general best practices. With some sense of stability now in place, leaders are shifting their focus to ensuring business continuity and instituting the necessary infrastructure to sustain new operational models indefinitely.

In organizations where digital transformation is already underway, adopting new tools and protocols will benefit from foundational support. Companies and agencies already making strides in harnessing IT to boost operations and outcomes will already be on their way to achieving better visibility and improved efficiency. Implementing behavioral correlation will, in those cases, apply the power of ML to aid in faster root-cause analysis and resolution—and a better overall customer experience.

That’s not to say that organizations less mature in their digital transformation won’t also benefit from eschewing the ineffective methodologies of event correlation and reactive ITOps. By evolving beyond legacy systems of piecemeal products and services, IT teams instead can leapfrog ahead to sophisticated analytics, data synthesis and comprehensive modeling that monitors for and detects anomalous activity—correlating broader behaviors, not just events. By jettisoning the focus on events and instead incorporating behavioral correlation into IT service, troubleshooting and remediation become fluid.

Amid seismic shifts in the demands on IT services, it’s more crucial than ever to deliver reliable services and capabilities. That means getting to the root causes of problems quickly and resolving them faster—a feat that’s now a reality through the integration of service metrics and IT automation.

The world of IT has changed dramatically over the past 10 years, and especially in the past few months. Traditional processes and tools are inadequate to manage the speed and complexity of the modern IT environment. A new era is arriving, and not a moment too soon.

David Link

David Link is the chairman of ScienceLogic and an IT visionary. He founded and built ScienceLogic by identifying large emerging markets, gaining intimate knowledge of customer IT problems, challenging conventional wisdom and bringing targeted, innovative products to market. As ScienceLogic's CEO, David used his market knowledge and customer focus to lead the ScienceLogic IT management system to dramatically exceed the needs and expectations of clients. Prior to founding ScienceLogic in 2003, David was Senior Vice President and a corporate officer at Interliant, Inc., where he led the establishment of Interliant's strong presence in the ASP/MSP market. He previously held senior management positions within IBM's Software Division, leading the development of Internet commerce products. David also spent nine years in IT solutions with CompuServe, building innovative global online communication solutions while establishing the market for business and consumer online services. David earned his BS at Denison University.

Recent Posts

Building an Open Source Observability Platform

By investing in open source frameworks and LGTM tools, SRE teams can effectively monitor their apps and gain insights into…

20 hours ago

To Devin or Not to Devin?

Cognition Labs' Devin is creating a lot of buzz in the industry, but John Willis urges organizations to proceed with…

21 hours ago

Survey Surfaces Substantial Platform Engineering Gains

While most app developers work for organizations that have platform teams, there isn't much consistency regarding where that team reports.

2 days ago

EP 43: DevOps Building Blocks Part 6 – Day 2 DevOps, Operations and SRE

Day Two DevOps is a phase in the SDLC that focuses on enhancing, optimizing and continuously improving the software development…

2 days ago

Survey Surfaces Lack of Significant Observability Progress

A global survey of 500 IT professionals suggests organizations are not making a lot of progress in their ability to…

2 days ago

EP 42: DevOps Building Blocks Part 5: Flow, Bottlenecks and Continuous Improvement

In part five of this series, hosts Alan Shimel and Mitch Ashley are joined by Bryan Cole (Tricentis), Ixchel Ruiz…

2 days ago