SRE

US DoJ Makes PyPI Give Up User Data ¦ Tape Storage: Not Dead

Richi Jennings | May 25, 2023 | backup, backup and recovery, backups, Canary, data backup, databackup, database backup, HPE, ibm, LTO, PyPi, Python, Python Software Foundation, Quantum Corporation, Ransomware, storage, tape, The Long View

In this week’s #TheLongView: PyPI complies with a “string of subpoenas,” and LTO continues to grow, despite predictions of its demise ...

Predicting, Preventing and Resolving Incidents With AIOps

Simon Nadar | April 13, 2023 | AIOps, IT operations, outages, resilience, SRE

IT operations teams, site reliability engineers (SREs) and service providers are on a mission to scale across geographies, expand their digital services and create new experiences for customers. Amid this drive, their ...

Linux Tweak Brings Big Speedup ¦ DCs in SPAAACE (Redux) ¦ Atlassian Fires 500

Richi Jennings | March 8, 2023 | Atlassian, data center, disaster recovery, edge computing, hyperthreading, layoffs, linux, Lonestar Data Holdings, moon, performance, The Long View, Thread

In this week’s #TheLongView: Intel optimizes Linux multithreaded networking, data centers in space (again), and more DevOps layoffs ...

Voice.ai ‘Stole’ Code ¦ AWS Gets Filthier

Richi Jennings | February 8, 2023 | Amazon, AWS, Bloom Energy, CO2 emissions, data center, data centers, datacenter, gpl, Green data center, LGPL, open source licensing, The Long View, voice.ai

In this week’s #TheLongView: Alleged theft of GPL code, and Amazon will run its data centers on gas ...

SRE DevOps jobs Log4Shell patching security DevSecOps

Why SREs Are Critical to DevOps

Nahla Davies | January 5, 2023 | adopting DevOps, DevOps practices, site reliability engineering, SRE

Although a relatively new concept, site reliability engineers (SREs) have become crucial for DevOps teams, helping to solve an array of operational problems such as network availability and user experience. However, in ...

Agile Sucks (Redux) | Plus: DevOps on Mars

Richi Jennings | August 18, 2022 | agile, Mars, NASA, scrum, The Long View, wagile, waterfall

In this week’s The Long View: Agile is bad, but “Wagile” is worse. Plus: This prod is even worse than yours ...

FILE PHOTO: Rogers Building, quarters of Rogers Communications in Toronto

The Rogers Outage of 2022: Takeaways for SREs

JP Cheung | August 15, 2022 | incident response, Rogers outage, site reliability, SRE

When, eight years from now, folks are creating lists of the top IT incidents of the 2020s, there's a good chance that they'll include the Rogers outage of 2022. The failure, which ...

SSE edge cloud observability New Relic outage AppDynamics

5 Ways to Prevent an Outage

Ashley Stirrup | August 15, 2022 | AWS outage, outage, site reliability engineering, SRE

In today’s always-on, ever-connected world, we all expect 100% availability. What gets in the way of this? The devil is in the details. Over time, everything breaks: Disks, nodes, containers, networks, DNS ...

Why More Incidents Are Better

Andre King | July 15, 2022 | application performance management, incident response, site reliability engineering, SRE

Ask most SREs how many incidents they’d have to respond to in a perfect world, and their answer would probably be 'zero.' After all, making software and infrastructure so reliable that incidents ...

mean-time incident responselog4j supply chain

5 Mean-Time Reliability Metrics To Follow

Bill Doerrfeld | July 7, 2022 | mean time to repair, mean time to restore, metrics, MTTX, SRE

Most folks working in DevOps or SRE roles are familiar with metrics like mean-time-to-recovery (MTTR). Keeping track of the average time a team takes to respond to incidents is crucial to identifying ...

Site Reliability Engineering (SRE) Comes of Age in 2022

Bill Doerrfeld | January 25, 2022 | High availability, resilience, site reliability engineering, SRE

The site reliability engineer (SRE) role is still gathering steam across organizations. In January 2022, LinkedIn listed SRE as the 21st job with the highest global demand throughout the past five years ...

Elastic SRE chaos engineering LitmusChaos

CNCF Takes LitmusChaos Platform to the Incubation Level

Mike Vizard | January 11, 2022 | chaos engineering, CNCF, DevOps workflows, resiliency testing, SRE

The technical oversight committee (TOC) for the Cloud Native Computing Foundation (CNCF) announced today it is elevating the open source LitmusChaos application testing platform to the incubation level. LitmusChaos is a chaos ...