Tag: site reliability engineering
Focus on Basics or Innovation? DevOps Success Requires Both
DevOps teams are experiencing a chasm between traditionalists and innovators — a rift that grows as pressure mounts for teams to leverage the latest advancements. Traditionalists are fighting to preserve the basics, ...
Building an Open Source Observability Platform
By investing in open source frameworks and LGTM tools, SRE teams can effectively monitor their apps and gain insights into system behavior ...
Harnessing AI for Automated and Toil-Free SRE
AI not only reduces toil but also contributes to improving system reliability, efficiency and scalability, forming a critical part of modern SRE practices ...
Revolutionizing the Nine Pillars of SRE With AI-Engineered Tools
In my blog Rapid Strategic SRE Assessments Accelerate IT Transformations published last year, I classified site reliability engineering (SRE) into nine pillars of SRE practices—a comprehensive framework that covers the full scope ...
Why SREs Are Critical to DevOps
Although a relatively new concept, site reliability engineers (SREs) have become crucial for DevOps teams, helping to solve an array of operational problems such as network availability and user experience. However, in ...
Best of 2022: Day in the Life of a Site Reliability Engineer (SRE)
As we close out 2022, we at DevOps.com wanted to highlight the most popular articles of the year. Following is the latest in our series of the Best of 2022. By now, ...
SRE Survey Reveals Major Technical and Cultural Challenges
Catchpoint, in partnership with Blameless, today published an annual survey of 559 site reliability engineers (SREs) that found 59% of respondents didn't view tool sprawl to be a major concern. Another 40% ...
Scaling Predictive Analytics With AIOps to Drive Next-Gen SRE
Enterprise systems are only as valuable as they are reliable, in the sense that they don’t suffer excessive breakdowns. Otherwise, companies experience costly downtime and added stress for engineers due to the ...
5 Ways to Prevent an Outage
In today’s always-on, ever-connected world, we all expect 100% availability. What gets in the way of this? The devil is in the details. Over time, everything breaks: Disks, nodes, containers, networks, DNS ...
Why More Incidents Are Better
Ask most SREs how many incidents they’d have to respond to in a perfect world, and their answer would probably be 'zero.' After all, making software and infrastructure so reliable that incidents ...
How to Adopt an SRE Practice (When You’re not Google)
Site reliability engineering (SRE) isn’t a new term or practice. The practice of applying software engineering skills and principles to operations problems and tasks happened even before site reliability engineer was a ...
The Pros and Cons of Embedded SREs
To embed or not to embed: That is the question. At least, that’s one of the questions that companies have to answer as they decide how to implement site reliability engineering. They ...