The State of IT Ops

In our most recent article, we highlighted some of the results of our recent IT Ops survey. We noted, for example, that 80 percent of IT teams are alerted to critical events via email and 41 percent of teams are experiencing excessive alerts. These stats go a long way to explaining why IT Ops is experiencing so many issues with alert fatigue.

In this article, we will delve into the second half of our findings and examine our results on how IT teams reported they handle alerts after they come in.

Off Scheduler

Efficient IT operations teams realize the need for directed action when critical issues come in. This means that when significant IT issues occur, efficient teams will have alerts sent directly to the on-call engineer, who can manage the incident most effectively. The engineer will be the individual who is best-equipped to manage and quickly resolve the issue at hand.

Our survey however showed that most teams lack automatic schedulers which could ensure the correct individual is alerted. Rather, alerts are often sent to the team as a whole, or to individuals who are unable to resolve the issue. According to our statistics, teams are frequently alerted (67.3 percent) instead of the on-call individual.

The impact of this poor workflow is not only that issues are sent to individuals who are unable to resolve them, but time is wasted by this very process. Ideally, team managers should invest in scheduling software that sends alerts directly to the engineer on-call. By sending it to the wrong individual or team, companies are adding fuel to the fire of alert fatigue and burnout.

Related Articles:

Site Reliability Engineering: How to Make the Operations Side of DevOps Actually Work

Slack and Critical Alerting

Unfortunately, our team also highlighted that most teams lack scheduling technologies to enable this outcome. Our survey highlighted that only 28 percent of teams use an automated scheduler to ensure the correct individual or team is alerted.

By failing to use a system that ties alerts to on-call schedules, managers receive alerts via email, SMS or phone that they are not necessarily responsible for resolving. Managers then need to track down the on-call engineer who can manage the incident. The pain of this scenario is multiplied if the alert comes after-hours or during the night. Again, this workflow leads to significant delays in resolution time.

No Time for Downtime

As IT Ops are often tasked with issues such as maintaining server uptime or site uptime, the prospect of downtime is something engineers work hard to avoid. IT Ops are under significant pressure to maintain five-nines uptime.

For this outcome to occur, teams need to automate menial tasks and ensure they are quickly alerted to complex downtime issues that demand their attention. The goal needs to be getting the issue as quickly as possible in front of the individual who is best able to resolve the matter.

However, our survey showed that 45 percent of teams indicated it takes more than 10 minutes before IT teams even begin to resolve critical IT issues. Given that the cost of downtime is more than $5,000 per minute, according to Gartner, downtime can easily balloon to cost companies or their clients more than $50,000 in 10 minutes.

The significant expense cause by downtime has been seen in many recent scenarios. For example, in mid-May 2017, British Airways had a spectacular downtime event due to the failure of a power system in the company’s data center. British Airways was forced to cancel all its flights leaving from London airports after having a major IT system failure. The result was cancelled flights affecting more than 75,000 passengers. The costs of reimbursement were approximately $68 million and a 2.8 percent drop in the stock price of the parent company.

Failure to Collaborate

In the world of DevOps and IT Ops, we are often told of the importance of collaboration and how it leads to great innovation. Indeed, the goal of DevOps is to bring developers and operations into closer proximity, so they can collaborate more effectively. Better collaboration means better innovation and decreased costs.

However, our survey shows that while IT Ops might have the mindset for collaboration, they lack the necessary tools for ensuring that collaboration happens. Our survey showed that IT Ops teams often collaborate on tickets by email. According to the results of our survey, 86 percent of teams collaborate through email.

Yet, email is a truly poor way to collaborate when critical incidents come up. Email has few ways to bring critical emails to the top of the stack and keep them center of mind. Instead, email’s default is to let messages get buried under other emails. In attempting to collaborate on key issues, email fails to offer rapid communication methods that highlight critical messages.

Tools are available that enable better and faster communication among team members. Slack is a tool popular among many IT Ops engineers who are key drivers of effective collaboration. Yet, our survey showed that only 16 percent of ITOps use ChatOps as a form of collaboration during incidents. Slack’s motto is, it’s “where work gets done.” On their own, however, tools such as Slack lack a method for bringing important issues to the top. However, if attached to incident management systems such as OnPage, Slack and other tools like it can enable users to receive alerts that a critical issue needs to be addressed.

Improving Workflow

The goal of improving how teams handle alerts is to minimize downtime and improve efficiency. These shouldn’t seem like lofty goals. However, by not developing specific workflows that bring issues to the top and enable them to be resolved quickly, IT Ops teams are adding to the time until resolution and the ensuing cost.

At its core, the lack of specific workflows and tools to enable improved workflow points to teams that have much room for improvement. IT Ops teams need to find better ways to manage day-to-day operations so that when critical incidents do arise, the path toward resolution is quick and doesn’t require engineers to embark on a guessing game.

Conclusion

Unfortunately, the state of IT Ops has not progressed to the level of maturity one would hope. There is still a lot of poor alerting and workflow practices that could be improved if a more robust incident management system were employed.

There are a number of additional insights that can be garnered from our survey. I encourage you to take a moment and download a copy of the study and see what you can learn that will help your team.

— Orlee Berlove