DevOps.com

  • Latest
    • Articles
    • Features
    • Most Read
    • News
    • News Releases
  • Topics
    • AI
    • Continuous Delivery
    • Continuous Testing
    • Cloud
    • Culture
    • DataOps
    • DevSecOps
    • Enterprise DevOps
    • Leadership Suite
    • DevOps Practice
    • ROELBOB
    • DevOps Toolbox
    • IT as Code
  • Videos/Podcasts
    • Techstrong.tv Podcast
    • Techstrong.tv Video Podcast
    • Techstrong.tv - Twitch
    • DevOps Unbound
  • Webinars
    • Upcoming
    • On-Demand Webinars
  • Library
  • Events
    • Upcoming Events
    • On-Demand Events
  • Sponsored Content
    • Sponsored Content
    • CloudBees
    • IT as Code
    • Rocket on DevOps.com
    • Traceable on DevOps.com
    • Quali on DevOps.com
  • Related Sites
    • Techstrong Group
    • Container Journal
    • Security Boulevard
    • Techstrong Research
    • DevOps Chat
    • DevOps Dozen
    • DevOps TV
    • Techstrong TV
    • Techstrong.tv Podcast
    • Techstrong.tv Video Podcast
    • Techstrong.tv - Twitch
  • Media Kit
  • About
  • Sponsor
  • AI
  • Cloud
  • Continuous Delivery
  • Continuous Testing
  • DataOps
  • DevSecOps
  • DevOps Onramp
  • Platform Engineering
  • Low-Code/No-Code
  • IT as Code
  • More
    • Application Performance Management/Monitoring
    • Culture
    • Enterprise DevOps
    • ROELBOB

Home » Blogs » DevOps Practice » What Is Resilience Engineering?

What Is Resilience Engineering?

Avatar photoBy: Chris Riley on August 21, 2020 1 Comment

Admitting things will go wrong isn’t easy for anyone or any team. But modern engineering practices have moved beyond this fear, giving birth to a new practice in DevOps and site reliability engineering (SRE) known as resilience engineering.

Recent Posts By Chris Riley
  • Using Incident Response for Continuous Testing
  • Moving from NOC to the SRE Model
  • The Difference Between Capacity and Scalability Planning
Avatar photo More from Chris Riley
Related Posts
  • What Is Resilience Engineering?
  • SRE Is the Most Innovative Approach to ITSM Since ITIL
  • Building DevOps Careers: One Man’s Journey, Part 1
    Related Categories
  • Blogs
  • DevOps Practice
  • DevOps Toolbox
    Related Topics
  • devops
  • incident management
  • resilience engineering
  • SRE
Show more
Show less

Resilience engineering today isn’t thought of as a function. However, just as DevOps was a description of culture before it was a role and site reliability was an extension of operations before it was a focus, I wouldn’t be surprised if resilience engineering became a function in the new future. The first question most will ask however is, “Isn’t this just SRE?” The purpose of the term is to change the focus from simply reacting to incidents to developing long-term response strategies for them.

TechStrong Con 2023Sponsorships Available

Because the expectation in these environments is that things will break, resilience is the responsibility of existing DevOps and cloud operations teams. When applications and services do break, a “fly by the seat of your pants” response strategy will not work.

A Focus on Frameworks

Resilience engineering, while rooted in engineering practices, is largely focused on building strategies and a framework for their execution. This leaves the process of building resilience into a largely unestablished system in part because each system is unique. And, how you respond to issues in that system will likely be unique, even if the management plane that reports issues is not.

The job of resilience engineering is to:

Establish procedures, habits and decision trees

When things break, fight is the only option. Operators and on-call engineers need to address issues in a systematic and repeatable way and do their best to remove emotion and fear from the equation. This not only helps triage and resolve issues, but it also makes sure the activity associated with the issue leads to meaningful insights in post mortems and future collaboration. Part of that is establishing habits and decision-making processes for those who are on-call. The processes help prioritize what to focus on, help catch details because details are critical.

Be data-driven

Resilience engineering must rely on data. This is another place where traditional SRE practices grow with a focus on resilience. In typical SRE environments, the focus is on the now, using real-time dashboards of the current state. When you think about resilience you are thinking about the past’s impact on the future, not the now. The only way to do that is to make sure the data supports it; thus, part of resilience engineering is making sure the data is there. Instead of data silos that most organizations have across their delivery chain. Resilience engineering should ensure that telemetry across the entire delivery chain is captured, correlated and shared. The reason this is critical is that what happens early in the delivery chain directly impacts incidents. That activity can be the source of answers, it can be the triggers for rollback, or it can be the clarity needed to prevent similar issues in the future. Without continuity between each stage of the delivery chain, it’s easy to miss correlated events that can lead to more systemic problems.

Engineering out of reproducible incidents

For most, the best part of resilience engineering is taking what is learned from previous incidents and finding ways to automate future resolution. Learning from data and having consistency in habit leads to the ability to create runbooks and automate remediation for known issues. Often incident response audit trails can read like playbooks for addressing issues of a particular type. When the resolution is not directly related to code and the potential of issues to surface again in the future is inevitable, being able to build intelligence to address it saves waking someone up at midnight and much shorter impact on customers.

The Resilience Stack

Organizations looking to embrace resilience engineering need to have a toolkit built for it. It’s pretty straightforward but worth highlighting. The resilience stack will include:

  1. Observability and/or monitoring tool.
  2. Incident response tool.
  3. On-call strategy documentation.
  4. Post mortem process and documentation.
  5. Documented processes including recommend response steps.
  6. Path to automation.

For those with a relatively mature and automated environment, the next step is chaos engineering—embracing chaos as a way to get ahead of incidents before they happen in the wild.

Get your markdown skills to maximum level. There is a lot of documenting that needs to happen with comprehensive resilience engineering. But these documents should not be shelf-ware—they should be living and ultimately lead to the implementation of automation or feedback to development.

As for monitoring, observability and incident response tools, it’s not enough to simply implement them. Knowing how data will be collected, consumed and actualized is also necessary. So is establishing an on-call strategy with purpose, not just because having everyone on-call is the “cool thing to do.”

It may seem obvious to say that organizations want system resiliency. What is not obvious is how to execute it. When organizations see the gaps (and are often embarrassed by them), they understand that resilience is a focus for either current functions or new ones in the future. That is why it’s worthwhile to talk about resilience engineering and what makes it effective.

Filed Under: Blogs, DevOps Practice, DevOps Toolbox Tagged With: devops, incident management, resilience engineering, SRE

Sponsored Content
Featured eBook
The State of Open Source Vulnerabilities 2020

The State of Open Source Vulnerabilities 2020

Open source components have become an integral part of today’s software applications — it’s impossible to keep up with the hectic pace of release cycles without them. As open source usage continues to grow, so does the number of eyes focused on open source security research, resulting in a record-breaking ... Read More
« Activity Analysis No. 1
The CMDB Is Dead – Long Live the CMDB »

Techstrong TV – Live

Click full-screen to enable volume control
Watch latest episodes and shows

Upcoming Webinars

Evolution of Transactional Databases
Monday, January 30, 2023 - 3:00 pm EST
Moving Beyond SBOMs to Secure the Software Supply Chain
Tuesday, January 31, 2023 - 11:00 am EST
Achieving Complete Visibility in IT Operations, Analytics, and Security
Wednesday, February 1, 2023 - 11:00 am EST

Latest from DevOps.com

The Strategic Product Backlog: Lead, Follow, Watch and Explore
January 26, 2023 | Chad Sands
Atlassian Extends Automation Framework’s Reach
January 26, 2023 | Mike Vizard
Software Supply Chain Security Debt is Increasing: Here’s How To Pay It Off
January 26, 2023 | Bill Doerrfeld
GitLab Strengthens Remote DevOps Management
January 25, 2023 | Mike Vizard
Microsoft Outage Outrage: Was it BGP or DNS?
January 25, 2023 | Richi Jennings

TSTV Podcast

Sponsored Content

The Google Cloud DevOps Awards: Apply Now!

January 10, 2023 | Brenna Washington

Codenotary Extends Dynamic SBOM Reach to Serverless Computing Platforms

December 9, 2022 | Mike Vizard

Why a Low-Code Platform Should Have Pro-Code Capabilities

March 24, 2021 | Andrew Manby

AWS Well-Architected Framework Elevates Agility

December 17, 2020 | JT Giri

Practical Approaches to Long-Term Cloud-Native Security

December 5, 2019 | Chris Tozzi

On-Demand Webinars

DevOps.com Webinar ReplaysDevOps.com Webinar Replays

GET THE TOP STORIES OF THE WEEK

Download Free eBook

DevOps: Mastering the Human Element
DevOps: Mastering the Human Element

Most Read on DevOps.com

6 Ways To Empower Developers and Increase Productivity
January 20, 2023 | Bill Doerrfeld
Digital Experience and the Future of Observability
January 20, 2023 | Nik Koutsoukos
What DevOps Needs to Know About ChatGPT
January 24, 2023 | John Willis
Five Great DevOps Job Opportunities
January 23, 2023 | Mike Vizard
Microsoft Outage Outrage: Was it BGP or DNS?
January 25, 2023 | Richi Jennings
  • Home
  • About DevOps.com
  • Meet our Authors
  • Write for DevOps.com
  • Media Kit
  • Sponsor Info
  • Copyright
  • TOS
  • Privacy Policy

Powered by Techstrong Group, Inc.

© 2023 ·Techstrong Group, Inc.All rights reserved.