Yes, you read that correctly: SRE means that strong application performance—speed, reliability and availability—no longer is simply a best practice. It is a must. Site reliability engineering, or SRE, is a term Google coined several years ago, and while there is no universal definition, SRE can be thought of as a discipline that blends software engineering with IT operations management. At the heart of the SRE model is a site reliability engineer who knows how to code as well as handle ops. The SRE concept is similar to DevOps, which is intended to bridge the divide between development and operations teams, but until now has not caught on the same way DevOps has.
The Need for SRE
The rise of the digital economy is now propelling end-user application performance to the top of the IT priority list. According to a recent EMA Survey, 55 percent of IT professionals rank end-user experience monitoring as their top application management priority. At the same time, growing IT complexity, including the cloud and multi-platform applications, are making it more difficult than ever to ensure strong user performance and identify the root source of problems before they escalate. We believe this performance imperative is driving the SRE resurgence, giving DevOps teams the necessary “teeth” to govern and resolve performance-related disconnects, swiftly and decisively.
Performance issues traditionally have been a lightning rod of conflict for even the most well-oiled DevOps teams. Developers, relentlessly focused on rolling out new products and features, keep throwing their “completed” work over the fence to operations. Operations teams, conversely, are focused on minimizing change and maintaining stability as much as possible. These opposing desires naturally create friction.
Performance problems only exacerbate this. Developers, who have already moved on to the next must-have feature, say it’s operations’ fault. Operations points their fingers at the developers, saying that poorly written code is to blame. No one is assuming responsibility and the developers and operations teams are stuck in a stalemate. Meanwhile, users are suffering, the call center is being inundated, revenues likely are circling the proverbial drain and the situation is quickly reaching a boiling point.
SRE is valuable because it finally brings an end to this suffering through a set of clear, irrefutable, mathematically based rules to govern all DevOps efforts and decisions. If a new piece of software achieves a targeted performance level, further rollouts can proceed. If not, future releases will halt until the first piece of software is fixed. The site reliability engineer, who is well-versed in both development and operations, serves as an independent arbiter who provides a clear assessment of where the root cause lies—on the development or operations side of the house. With guidelines firmly established and a central, unbiased person in charge of determining the right course of action, the developer/operations friction can be avoided and problem resolution can commence.
In an SRE model, developers and IT operations also share a staffing pool. If there’s budget for one hire, and a product is not delivering an agreed-upon performance level, the hire goes to the IT operations. If the product is delivering the required performance level, the hire can go to development, which will help the development team roll out new features and products.
The real beauty of SRE is that both sides have a vested interest in achieving strong application performance, ideally right out of the gate. If whichever side owns the problem does not fix it, they can forfeit any hopes of adding manpower to their team, at least in the near-term. There’s another more serious implication: for developers, there will be no further development; and for operations teams, less software to support may translate to lower headcount requirements.
An SRE approach has many advantages, including fostering a vested interest in performance across the entire DevOps team, and holding feet to the fire. Performance cannot be an afterthought—performance must be considered and measured from the very beginning phases of ideation and development all the way through production and post-production modifications. Users don’t care how fast or efficient your DevOps team was if the ultimate work product does not perform—in fact, this means the whole effort likely was in vain. Like a strict adjudicator, SRE may just be what DevOps teams need to bring to an end to performance-related squabbles once and for all.
About the Author / Mehdi Daoudi
Mehdi Daoudi is CEO and co-founder of Catchpoint Systems, provider of web performance testing and monitoring solutions. Before Catchpoint Systems, he spent 10+ years at DoubleClick and Google, where he was responsible for Quality of Services, buying, building, deploying, and using monitoring solutions to keep an eye on an infrastructure that delivered billions of transactions daily. Connect with him on LinkedIn and Twitter.