How IT Ops Can Exceed Service Level Objectives in Digital Transformations

The pace of change can be managed successfully by defining service level objectives and more in dev environments

Mobile applications, data lakes, microservices, data visualizations, SaaS integrations, automations, IoT data streams, machine learning models—in proof of concepts, pilots and scaling production environments, for customer-facing capabilities and employee workflows—all of these technical capabilities are developed, deployed and enhanced faster today more than ever before.

I spoke to Jason Walker, field CTO at AIOps platform BigPanda, about how the speed of deployment, the breadth of technology services transforming businesses are developing, the greater security threats, and the increase in reliability and performance requirements impact IT Ops.

Walker believes that of all the things we’re trying to do in IT—more, faster, smarter, safer, innovative, secure, reliable—it’s the speed that’s the driving force. “The most significant impact is velocity; the dev-test-deploy cycle time is drastically reduced,” he said. “Without the right guardrails, that breeds unnecessary complexity and a gradual loss of operational awareness.”

He explained that much of the barriers that once slowed down development teams are addressable today when developing service-based architectures on the cloud. “Developers realize that the traditional constraints, either technical dependencies or organizational capability, are much reduced,” Walker noted. “Developing in the cloud for a cloud-based service, leveraging an ecosystem of microservices for inputs, an agile team can move very quickly.”

Why IT Ops Can’t Slow Down Transformations

IT Ops can’t easily say “no” or “slow down” to business stakeholders investing in digital transformation to improve customer experiences and gain competitive advantages with data, analytics and machine learning. Some IT leaders attempted that command-and-control approach during the early days of public clouds, but today, DevOps practices, SRE responsibilities and AIOps capabilities are integral to mainstream IT Ops teams in keeping up with transformational velocities.

So instead of saying “no,” progressive IT Ops teams say “yes, but” by defining service level objectives, (SLOs), capturing service level indicators (SLIs) and managing to error budgets.

Walker agreed. “SLIs, SLOs and error budgets are a very useful way to manage the critical inputs and outputs at the interfaces between microservices and at a high level across the business service, allowing developers to keep changing the ‘interior’ pieces.“

These tools change the operating model and mindset across the entire IT organization by exposing trade-offs to business stakeholders. For example, if a web application has a 99.9% SLO, the whole IT team has a 0.1% error budget. If the SLO is missed, a service level policy identifies areas of investment to improve performance, reliability, security and automation or to address technical debt.

What IT Teams Should Do to Implement Service Level Objectives

Defining service level objectives helps bring business stakeholders, development teams, SREs and IT Ops together and align on reliability objectives and trade-offs. It’s an important step, but not sufficient for teams that want to exceed service levels during digital transformation.

Walker offered several technical recommendations for IT Ops groups transitioning to service level objectives:

Where there are dependencies between microservices and interfaces, frequent check-ins between adjacent teams are required.
APIs, inputs and outputs, need to remain consistent over time and changes communicated and receipt confirmed.
Knowledge management processes delivering accurate, up-to-date documentation have to be baked into the SDLC to prevent the generally small, modular teams from sprawling away from each other and developing incompatibilities.
Consolidated change awareness for operations is also a must-have. Whether changes are human or automated, they have to be tracked and relatable to service events and alerts.
The later phases of the SDLC, when cloud-native, containerized microservices are running and supporting customers, have to be monitored. A monitoring strategy is necessary to ensure effectiveness and minimize the work and noise involved.
Synthetics and client telemetry can be very useful macro-indicators of overall service performance. As with all monitoring efforts, actionability is key. Signal-to-noise in monitoring has to be measured.

These are balanced recommendations, with the first two focused on how development teams engineer microservices and the last two on how IT Ops teams use monitoring and AIOps to implement actionable SLIs. The middle two recommendations on knowledge and change management processes help the entire organization stay in sync through a fast-paced operating environment.

“The velocity, flexibility and variability of this type of development means leaders at all levels need to understand and align the strategic and tactical goals, and to prevent their teams from drifting away from business goals,” Walker said.

Why AIOps is a Force Multiplier for IT Transformation

So, the business leaders align on service level objectives, development teams engineer observable containerized microservices, IT Ops executes a monitoring strategy and the CIO ensures communication, collaboration and knowledge sharing. Is that all that’s required?

The issue is that most IT Ops teams are understaffed and get overwhelmed supporting the new cloud-native microservices, legacy systems and everything in between. To address the gap, IT leaders are investing in AIOps, and even hundred-year-old enterprises are successful at adopting machine learning and automation to accelerate IT Ops.

Automation and machine learning event correlation applied to monitors, alerts and observable artifacts are the force multipliers. Open-box machine learning enables IT Ops to triage incidents and improve their mean time to resolution, while automation reduces manual efforts and keeps teams in sync. Organizations that are modernizing applications and supporting hybrid clouds require these capabilities to manage the complexities, run at business speed and manage databases, microservices and applications to higher service level objectives.

As Walker noted, increasing velocity is important for responding to customer opportunities and changing conditions. Driving faster digital transformations is more achievable today when IT Ops leverages automation and AIOps to stabilize speed with reliability and performance.