SRE (Part 2): A Practical Approach

In my last piece on cultural SRE, I covered the basics of defining what SRE means for your team. However, some teams may not need dedicated SRE support. This is where the central organization comes in handy. Dedicated resources are focused on building guidelines, standards, services and platforms for teams to consume that can operate autonomously in your environment. Having a broadly communicated central team also allows for these other teams to know where to go when they need assistance or would like some recommendations on their technology stack.

Application/Service SRE Versus Infrastructure SRE

Application SRE includes:

Embedded for application/service teams.
Architecture guidance for new services and infrastructure.
Design and implementation for modern technologies.
Support within the application/service team.
Automation for apps and services for the team.
Automation for operational work.
Monitoring/metrics for application/service team.
Benchmarking and performance assistance for code.
Infrastructure deployment and architecture for applications and services.
Write documentation and runbooks for alerts issued by application/service stack.

Infrastructure SRE includes:

Build and management of “plumbing” technical infrastructure (provisioning, OS, dns, dhcp, networking, central auth, etc.).
Automation of infrastructure services (telemetry, monitoring, log aggregation, configuration management, anomaly detection, orchestration, etc.).
Build and management of consumable services and tools (message queues, databases, distributed compute farms, API services/integrations, containers, etc.).
Infrastructure as Code (IaC).
Implement Global IR&IM process.
Support for Application SRE teams.
Architectural guidelines and best practice documentation.

Organizational Layout

Organization structures, whether we like to believe it or not, really help in the flow of communication throughout the team. It allows for the team leaders to form a single mission and have that mission carried out by their team members. I mentioned in Part 1 of my SRE Narnia guide that I believe this organization must be centralized to be successful. Below is a high level overview that shows the separation between application SRE and infrastructure SRE.

The leads can handle multiple teams and disciplines. This can shrink and scale as you see fit within your organization. Application SRE is straight forward. The infrastructure SRE could be embedded across existing infrastructure teams or form a new team focused on a specific project.

Example SRE Interaction

Below is an example interaction between an application team (this one runs a ruby app) and an SRE hierarchy. You will see a team called “Frontline Support” as the initial interaction. This team is not required but is a nice to have. It helps in any environment big or small, and can really offload a significant amount of the operational workload along with having a global view of issues coming in.

The duties for frontline support are to follow runbooks for alarms flowing in from the monitoring systems. It is similar to a traditional NOC but with more expertise across the stack they are watching. It also allows for trend analysis and bringing data to conversations around reoccurring issues or bugs in the system.

I do recommend using this as a starting point for junior level team members. It allows them to learn the system quickly and buddy up with more seasoned members that are either on this team or have the pager for the week. It is a great way to train and onboard your technical employees. Nothing works better than showing them what is broken.

Summary

Overall, everyone’s journey will be different. There are quite a few books and examples that exist out there, but please only use them as ideas to form your own opinion on how these organizations should work. There is no prescription and nothing is ever perfect. I have left out quite a bit of information in this post but I am always interested in having conversations about this topic and others alike. There is one consistent theme you’ll see across the board and that’s shaping the culture. Focus on your culture and hire great talent along with great leaders. Good luck and happy hacking.

— Anthony Caiafa