Applications are the lifeblood of a modern business, and uptime is crucial — even more so during a pandemic. The COVID-19 crisis has shown us that we can adapt to a new normal. It has also shown us that customers will take their business elsewhere if an application experience does not meet their expectations.
The pandemic has put business-critical applications under unprecedented strain, and has forced organizations across all industries to rethink their business continuity plans when it comes to keeping their applications up and running. In these troubled times, it has often fallen on the humble site reliability engineers (SREs) to keep applications available and to ensure that they are delivering the best possible user experience.
SREs as First Responders: Paramedics for Your Applications
When unforeseen events occur that cause issues with your applications, it’s the SREs who come to your aid. SREs are your first responders — the paramedics for your applications. They are the team on call to administer life-saving CPR to your applications to keep them alive.
During outages, SREs know what to look for, how to interpret what they see and quickly understand what’s wrong. By monitoring your application’s vital signs, SREs can relay crucial information to your developers and infrastructure team so that they can decide which corrective surgery to perform to restore the application’s health.
SREs are the sutures that connect the developers and infrastructure teams with the product team to ensure that your applications are available and provide the intended functionality for your end users — always.
SREs as Wellness Coaches: Preventive Care Providers for Your Applications
Thomas Edison said that genius is 1% inspiration and 99% perspiration. It’s the same for SREs: They don’t become the heroes of the hour without extensive preparation. SREs are not only the paramedics for your applications during sudden and unexpected events, but they are also the wellness coaches who take preventative healthcare seriously.
Like wellness coaches, SREs define and develop strategies and action plans to improve the overall wellness of applications to prevent them from crashing. They set up regular application monitoring for early detection of issues so that they can proactively formulate a plan to restore application health.
The skill of SREs lies in their ability to wear multiple hats and translate business goals into corresponding technical details. They understand the business risk of application failure and work tirelessly on models to quantify it. And then they work with developers and product teams to understand what should be measured as a precursor to failure. They set and maintain service-level objectives (SLOs) aligned with business needs and monitor them constantly to spot and analyze trends.
SREs determine error budgets so that they will know how bad things can be before an issue becomes a problem. And, when something is trending out of scope, they help diagnose which services are causing the issue, define corrective actions and set the thresholds that enact auto-scaling to meet fluctuation in demand.
A Thousand Salutes in a Hundred Languages
SREs are, by definition, the proactive enforcers of business continuity. Their role is quite literally to prevent disasters from affecting the business by anticipating the unanticipated. They spend their days (and nights) looking at architectures and applications to assess what can go wrong and to determine how problems can be mitigated. In a pandemic, it is so often the emergency and preventive work of the SRE team that saves the business.
SREs are, indeed, the unsung heroes, and we should salute them. So, SREs across the world — thank you, dankie, gracias, merci, danke, dhanyavād, grazie, Xièxiè, dziękuję, спасибо, ευχαριστώ, arigatô, obrigado, mulţumesc. Have you thanked your SRE today?