By adopting a multilevel approach to site reliability engineering and arming your team with the right tools, you can unleash benefits that impact the entire service-delivery continuum
In today’s application-driven economy, the infrastructure supporting business-critical applications has never been more important. In response, many companies are recruiting site reliability engineering (SRE) specialists to help them improve reliability, availability, latency and a host of other metrics that impact customer satisfaction. These vital new people resources are focused on keeping sites and services running in peak form.
There are clear limitations to this operations focus, though. A service is made up of many application components that can impact business-critical services. How do you ensure reliability is baked into each app and each code change if your engineers focus only on backend performance?
Some trailblazers are beginning to tackle this issue head-on, tasking their SREs with taking on new roles that support key stages of the DevOps life cycle. The goal is to take early, proactive steps to ensure quality and reliability are built in from the beginning.
A Path Pioneered by Test Teams
To understand why this “shift left” is so important, consider recent revolutionary changes in the software testing function. Not that long ago, testing was considered a distinct phase in the software life cycle, timed to begin well after the work of application developers was done. Testing Centers of Excellence uncovered defects and passed code back to the development team for rework.
As developers moved to agile sprints, though, this approach to testing quickly became a bottleneck. By the time Centers of Excellence uncovered issues, developers had moved on to other projects. In response, new continuous testing programs emerged, staffed by a new breed of software development engineers who had honed their testing skills.
These individuals are now embedded with development teams to ensure applications are built better from the start. Open source test automation tools help them test early and often, without impeding agile delivery schedules.
It’s important to note that Testing Centers of Excellence haven’t gone away. Instead, they have taken on a broader leadership role and are now responsible for coordinating testing across the enterprise in support of continuous software integration and delivery.
A 3-Level Model for SRE
There is much that SRE teams can learn from the experiences of their testing colleagues. To produce optimal results, it is important to integrate reliability experts into your organization’s overarching agile framework. Consider adopting the following three-level model that shifts the function to the left and integrates skilled engineers at key points in the DevOps life cycle:
- Application level. Embed SREs with development and testing teams to ensure application components and new releases won’t disrupt reliability. These individuals focus on application-level service objectives, error budgets and integration into the DevOps pipeline.
- System level. Assign SREs to focus on the applications in your organization’s release train. These individuals support release and launch coordination, evaluate system architecture readiness and ensure your organization meets systemwide service-level objectives.
- Enterprise level. Establish a Site Reliability Engineering Center of Excellence, staffed by engineers who oversee the governance of your enterprise architecture. These individuals establish best practices and select the tools and resources required to support your site reliability function companywide.
An Expanded Toolset
To support this broadened role, you will need tools that can support your engineers across the entire DevOps life cycle. You will need to be able to see and track issues engineers uncover and to evaluate system-level readiness and performance against your service-level objectives. You will also need a big-picture view that spans activity across the entire enterprise.
Fortunately, a new generation of solutions is now available to help, powered by artificial intelligence, machine learning and intelligent automation. These resources can help your team process massive amounts of data from disparate toolsets and turn that data into actional insights. As a result, you’ll have all the lifecycle analytics you need to be effective in your new and broadened role.
SRE and Shift Left: Broad New Benefits
By adopting a multilevel approach to sire reliability engineering and arming your team with the right tools, you can unleash benefits that impact the entire service-delivery continuum.
You can build a continuous cycle of feedback, collaboration and governance that spans design, development and the launch and operation of new services. You can better manage configuration changes, service levels and error budgets. You can get a clearer view of application reliability and production readiness to guide planning initiatives and to help you prioritize investments. And you can build better-informed business cases to move your organization forward. Shifting left is clearly a win-win move.