SRE

US DoJ Makes PyPI Give Up User Data ¦ Tape Storage: Not Dead
In this week’s #TheLongView: PyPI complies with a “string of subpoenas,” and LTO continues to grow, despite predictions of its demise ...

Predicting, Preventing and Resolving Incidents With AIOps
IT operations teams, site reliability engineers (SREs) and service providers are on a mission to scale across geographies, expand their digital services and create new experiences for customers. Amid this drive, their ...

Linux Tweak Brings Big Speedup ¦ DCs in SPAAACE (Redux) ¦ Atlassian Fires 500
In this week’s #TheLongView: Intel optimizes Linux multithreaded networking, data centers in space (again), and more DevOps layoffs ...

Voice.ai ‘Stole’ Code ¦ AWS Gets Filthier
In this week’s #TheLongView: Alleged theft of GPL code, and Amazon will run its data centers on gas ...

Why SREs Are Critical to DevOps
Although a relatively new concept, site reliability engineers (SREs) have become crucial for DevOps teams, helping to solve an array of operational problems such as network availability and user experience. However, in ...

Agile Sucks (Redux) | Plus: DevOps on Mars
In this week’s The Long View: Agile is bad, but “Wagile” is worse. Plus: This prod is even worse than yours ...

The Rogers Outage of 2022: Takeaways for SREs
When, eight years from now, folks are creating lists of the top IT incidents of the 2020s, there's a good chance that they'll include the Rogers outage of 2022. The failure, which ...

5 Ways to Prevent an Outage
In today’s always-on, ever-connected world, we all expect 100% availability. What gets in the way of this? The devil is in the details. Over time, everything breaks: Disks, nodes, containers, networks, DNS ...

Why More Incidents Are Better
Ask most SREs how many incidents they’d have to respond to in a perfect world, and their answer would probably be 'zero.' After all, making software and infrastructure so reliable that incidents ...

5 Mean-Time Reliability Metrics To Follow
Most folks working in DevOps or SRE roles are familiar with metrics like mean-time-to-recovery (MTTR). Keeping track of the average time a team takes to respond to incidents is crucial to identifying ...

Site Reliability Engineering (SRE) Comes of Age in 2022
The site reliability engineer (SRE) role is still gathering steam across organizations. In January 2022, LinkedIn listed SRE as the 21st job with the highest global demand throughout the past five years ...

CNCF Takes LitmusChaos Platform to the Incubation Level
The technical oversight committee (TOC) for the Cloud Native Computing Foundation (CNCF) announced today it is elevating the open source LitmusChaos application testing platform to the incubation level. LitmusChaos is a chaos ...