The amount of routine toil that site reliability engineers (SREs) perform declined slightly in the last year even though IT environments in general are becoming more complex to manage.
An annual survey of 300 SREs conducted by Catchpoint, a provider of an IT monitoring platform, in collaboration with VMware and the DevOps Institute suggests that the amount of toil—defined as low-level manual tasks—declined 15% in the last year.
Only 22% of respondents claim to be scientifically measuring toil, so, to a degree, the survey results are apocryphal. Overall, the survey, published this week, finds SREs are, not surprisingly, spending the bulk of their time on responding to incidents and outages, followed by participating in postmortem analysis and on-call rotations. The primary sources of toil cited in the survey are too much technical debt, lack of alignment of priorities and goals, lack of appreciation of business value, lack of training and support and lack of collaboration.
In terms of automation adoption, the most common use cases cited are release management, infrastructure management and application management. The tools most commonly employed by SREs are used to monitor infrastructure, networks and application performance. Adoption of artificial intelligence (AI) tools to manage IT operations (AIOps) remains low. The survey also finds respondents are roughly split evenly between centralized (42%) and decentralized (38%) approaches to managing IT. Another 20% said they have adopted a hybrid approach. Lack of visibility across the stack (53%) was the most cited cloud application monitoring challenge.
The survey also finds only half of respondents continually refine their service level objectives (SLOs).
Leo Vasiliou, director of product marketing for Catchpoint, said it’s clear that SREs are becoming more proactive in terms of preventing and containing the impact of IT incidents. The challenge they face is lack of visibility and context as those IT environments become more complex. Complexity is increasing, in part, because of the rise of cloud-native computing platforms running microservices-based applications that have lots of dependencies.
Less clear right now is the percentage of IT environments managed by SREs today. In theory, a single SRE can do the same amount of work as multiple IT administrators by automating processes using various DevOps tools rather than relying on traditional graphical tools. However, SREs are hard to find and retain. It’s also not always apparent to what level of automation needs to be attained to qualify to be an SRE or, for that matter, what certifications may be required. Interest in becoming an SRE is rising, in part, because they typically command higher salaries.
In the meantime, it’s clear organizations will need to find ways to automate the management of IT to a much greater extent. Most organizations cannot afford to hire a small army of IT professionals to manage IT environments that become more complex with each passing day. Everything from DevOps best practices and next-generation observability tools that promise to provide more context to AI platforms will all be part of the mix. The challenge is finding a way to implement those tools before IT management spins further out of control.