A Strategic SRE Assessment is critical in helping organizations successfully transform traditional IT operations to SRE
Site reliability engineering (SRE) is rapidly becoming more important to organizations that need to scale IT operations while maintaining or improving reliability and security, and to modernize operations to match the growth of DevOps in the organization. Without sufficiently thorough consideration given to best practices for people, processes and technologies, the transformation from traditional IT operations to SRE is unlikely to succeed. Despite challenges, the transformation to SRE is compelling because it provides an opportunity to reduce total operations costs and improve security and reliability while scaling services.
To be successful, IT leaders need a comprehensive understanding of SRE best practices and need a well-engineered, yet affordable strategy and road map that effectively implements best practices and balances the many aspects needed to transform people, proces and technology. A Strategic SRE Assessment approach, based on best practices, is a key tool to help leaders understand, obtain the strategy and create a road map that will successfully transform traditional IT operations to SRE.
There is no “standard” SRE assessment approach in the industry. This article enumerates a practical and proven Strategic SRE Assessment approach. The heart of the Strategic SRE Assessment approach is a survey of SRE best practices categorized under nine pillars: Culture, Work Sharing, Toil Reduction, SLAs/SLOs/SLIs, Measurements, Anti-Fragility, Deployments, Performance Management and Incidents Management. The survey is used to collect importance and capability scores for each of the SRE practices. As part of the assessment, a survey determines the current state and preferences of SRE practices for an organization for each of the nine SRE pillars.
The following are examples of best practices for each pillar that are part of the Strategic SRE assessment survey:
Culture – Leaders need to understand and sponsor a clear vision for SRE.
- The culture embraces working practices to proactively shifts left the “Wisdom of Production.”
- The culture encourages learning from failure and continuous improvement.
- The SRE role is clearly defined and understood and SREs are considered a vital part of the organization.
- Rules of engagement and principles for how SRE teams interact with their environment are codified—not only the production environment but also the product development teams, testing teams, users and so on. Those rules and work practices help to maintain SRE focus on engineering work as opposed to operations work.
Work Sharing – Successful SRE requires that work be shared between SRE teams and development.
- SRE teams are composed of software developers with strong operations knowledge or IT operations people with strong software development skills.
- SRE workloads are managed to a budget with typically 50% or less spent on operations work.
- SRE on-call work budgets are typically 25% or less. Developers share on-call work.
- Technical debt is worked in small increments.
Toil Reduction – SREs work to reduce non-value-added work through automation and standards.
- Work standardization and automation—of both tools and processes—is employed to improve scalability, repeatability and other important goals.
- Work standardization is used to reduce time spent on non-value-added work.
- Automation is used to reduce time spent on non-value-added work.
SLAs/SLOs/SLIs – Service level metrics, based on user perspectives, combined with error budget policies, are critical to managing flow and reliability.
- SLOs are selected to measure availability, performance and other metrics in terms that matter to an end user.
- An error budget (= 1-availability) internal service level objective (SLO) is established for each service. SLOs are set to be more stringent than external service level agreements (SLAs) promised to customers. The SLO budget considers what level users will accept, alternatives for users who are dissatisfied, and usage patterns of the product at different availability levels.
- An error budget policy prescribes consequences if an error budget is spent. For example, the service freezes changes (except urgent security and bug fixes addressing any cause of the increased errors) until either the service has earned back room in the budget or the count period resets.
Measurements – Smart monitoring implements observability to accelerate response times to incidents.
- Monitoring solutions are designed to produce service level indicators (SLIs) from which SLOs can be calculated automatically.
- SLIs do not require a human to interpret an alerting domain. Instead, software observability algorithms do the interpreting, and humans are notified only when they need to act.
- Monitoring systems use telemetry and instrumentation to ensure the data being monitored is accurate, relevant and available in real-time.
Anti-Fragility – Systems of people, processes and technologies are continuously tested and improved to assure they are resilient enough to serve applications as they scale.
- Anti-fragility strategies are practiced to proactively assure the resilience of applications, infrastructures and pipeline services.
- Fire drills are periodically conducted to proactively determine weaknesses in systems of people, processes and technologies that could potentially impact SLOs.
- Chaos monkey or equivalent infrastructure testing strategies are used to proactively determine weaknesses in infrastructure systems that could potentially impact SLOs.
- Security testing and DevSecOps practices reduce the risk of security vulnerabilities.
Deployments – Deployment strategies reduce the risk of deployment failures.
- Non-emergency deployments use progressive rollouts in which changes are applied to small fractions of traffic and capacity at one time. If unexpected behavior is detected, the changes are rolled back first and diagnosed afterward to minimize mean time to recovery.
- Automation is used to implementing progressive rollouts, detect problems and implement rollback changes safely when problems arise, because outages are primarily caused by changes in a live system.
- Deployment strategies such as blue-green, feature flag rollouts, A/B testing and canary rollouts are used to reduce the blast radius of failed deployments.
Performance Monitoring – Proactive testing and monitoring ensure applications and infrastructures have the capacity necessary to flex and scale when needed.
- The SRE team is responsible for provisioning and monitoring utilization because utilization is a function of how a given service works and how it is provisioned.
- The SRE team is responsible for capacity planning and provisioning. Regular load testing of the system is used to correlate raw capacity (servers, disks and so on) to service capacity.
- Provision to handle a simultaneous planned and unplanned outage without making the user experience unacceptable. This results in an “N + 2” configuration, in which peak traffic can be handled by N instances (possibly in degraded mode) while two instances are unavailable.
Incident Management – Incidents are managed in an efficient, blameless manner with a goal to use incidents to improve responses.
- Playbooks record the best practices for human response actions because playbooks produce roughly a 3x improvement in mean time to recovery.
- Emergency response performance is measured using mean time to repair (MTTR) metrics instead of mean time to failure (MTTF) because the most relevant metric in evaluating the effectiveness of emergency response is how quickly the response team can bring the system back to health—that is, the MTTR.
- Postmortems are blameless and focus on process and technology, not people.
- Postmortems are conducted for significant incidents regardless of whether they resulted in a page; postmortems that did not trigger a page are even more valuable, as they likely point to clear monitoring gaps. Find all root causes of the event and assign actions to correct the problem or improve how it is addressed next time.
SRE Assessment Surveys Are Not Enough
The survey alone is an essential tool to collect information about the current state and preferences for each of the SRE practices. However, this is only part of the comprehensive Strategic SRE Assessment approach.
The Strategic SRE Assessment approach starts with workshops to get alignment between leaders and teams regarding big-picture goals prior to conducting the survey. This is necessary so that everyone who answers the survey has a common perspective and can better respond with considered views of importance and capability scores for each of the practices. The results of the survey include an analysis that points out specific pillars and individual practices that have a high gap score in which the practices are scored as important but not yet practiced well. The high-priority gaps are then used to organize a value-stream map working session, that is designed to find the highest priority bottlenecks in people, processes and technology bottlenecks in the value steam, relative to the organization’s goals and gaps. With the goals, gaps, and data from the value stream map in hand, the next step is to formulate an implementation road map that directly addresses the goals, gaps and bottlenecks in the value stream.
The final step is to obtain strategic alignment of the big-picture plan by reviewing the findings and road map with the leadership and implementation teams. This entire Strategic SRE Assessment approach, from goal alignment through surveys, value stream mapping, road map creation and strategic alignment, varies depending on several factors such as the size of the organization, number of applications and size of the deployment domains to be included in the assessment.
A Rapid Strategic SRE Assessment for a sample set of model applications typically is conducted in 21 days. More complex Strategic SRE assessments involving many applications may take up to 90 days.
In most cases the Rapid Strategic SRE Assessment is the best place to start. Most often, it is not a good idea and is too expensive to “boil the ocean” and make changes across all applications and deployment domains at once. By starting with a small number of model applications, it is easier to implement, measure and demonstrate success and use the results at each step to justify ongoing investments needed to complete the transformation across the organization. Another advantage of the Rapid Strategic SRE Assessment is that it is a quick and affordable approach to check-point progress as SRE is implemented progressively across applications and deployment domains.
Key Takeaways for SRE Transformation Leaders
SRE offers immense value for modernizing and scaling IT operations reliably and securely and to complement DevOps transformation. Adherence to SRE best practices is essential to reduce risk and assure security as the transformation is implemented.
Each organization is different and has different requirements and priorities for its applications and deployment domains. The specific SRE practices that are the highest priority vary according to the priorities of applications and deployment domains. The Rapid Strategic SRE assessment approach, based on nine pillars of SRE best practices, is quick, effective and affordable.
The Rapid Strategic SRE Assessment is a preferred approach to kick-start the SRE transformation. This approach is also a preferred and affordable tool to reassess the direction as SRE is implemented progressively across the organization.
Summary
There is no “standard” SRE assessment approach in the industry. The Rapid Strategic SRE Assessment approach described in this article has been tried and proven with large and smaller organizations. The author invites your comments regarding this approach and the SRE best practices.