How To Build a Culture of Resilience Through Good Habits

Good habits are hard to form. I’ve been listening to the audiobook “Atomic Habits“ by James Clear on my morning runs, and something struck me. At Gremlin, along with our software, what we’re trying to promote are positive new habits for our customers. According to the author, one of the primary reasons new habits don’t stick is because there’s often sacrifice without immediate gratification.

Psychologically, we’re wired to want instant gratification. But not all habits give immediate rewards; in fact, many delay our gratification for some time. So how do we help ourselves pursue good habits? Put simply: We need to make them obvious, attractive and easy.

One of the ways to make them easy is to focus on what the author calls “gateway habits”—the smallest piece of the habit that can reasonably be achieved in two minutes. So if your goal is to eventually run a marathon, the gateway habit is putting on your running shoes every day.

To build a culture of resilience at your company, start small and create getaway habits. If a team runs a GameDay once a month (time dedicated to experimenting on your systems) or even simply runs their first single chaos engineering experiment, then award that person or team with immediate recognition.

Here are some other ways to build a culture of resilience at your organization:

Recognize the change to new habits.
Create DNR (do not repeat) items.
Adopt “You build it, you own it.”
Track the four golden signals.

Recognize the Change to New Habits

Incentives are a great way of kick-starting a new habit, but they don’t necessarily sustain the good behavior. It’s identifying the improvements that result from the new habit that really makes it stick. We’ll get to some specific metrics you can track later in the article, but notice for now that identifying the improvements is when gratification starts to drive enthusiasm. Ideally, that enthusiasm grows until the new habit becomes part of your identity.

In our example of running a marathon, the moment of most significant change is when the person starts to self-identify as a runner. Then, the habit is no longer a chore, but rather part of who they are. Ideally, we want all computer engineers to adopt a specific set of habits until they consider themselves site reliability engineers (SREs) as part of their identity. That’s when the habit is solidified.

Create DNR (Do Not Repeat) Items

Engineers and product managers want to ship new products and features. There’s nothing quite as satisfying in software development as deploying new code and seeing what you built running out in the world. But, if what you built is consistently breaking or providing a bad user experience, then you are hurting your customers and ultimately your business.

To make sure we are always learning and getting better, at Gremlin we have what we call “DNR,” or Do Not Repeat work. This work consists of action items from outages and incidents that must not ever be repeated, lest we fail to learn our lesson from these failures. Practically, what this means is all feature work is halted until the issues highlighted as DNR work are remedied and the fixes are verified. In other words, you don’t get to write new code until your old code is fixed. We all know that many teams struggle with the trade-offs of moving fast, but ultimately, if you don’t have strict guidelines in place, then more often than not engineers will prioritize shipping something over making sure it’s reliable.

Creating a DNR item is an easy way to incentivize the behavior you want to see internally by appealing to the engineer’s desire to produce new features. We convince them to write better code because better code means they get to spend less time fixing things.

Adopt “You Build It, You Own It”

This is the driving principle of DevOps. It is the reason behind shifting left. When the team that develops the software is different from the team that operates it, then there’s a misalignment of incentives. If I am a developer being tracked (and promoted) solely on the amount of code I ship, my focus will be on getting more bits out the door and not on ensuring the features I release will withstand the burdens of operation. That’s another team’s concern.

That’s the motivation behind the proliferation of the “you build it, you own it” mindset at top-performing organizations. Hell, my first day at Amazon, they tossed me a pager and said “Good luck.” And while that may sound daunting (and it was), I can tell you that it not only motivated me to make sure I built systems to last, but it also fundamentally changed the way I thought about architecting systems. Spoiler: I’m not a big fan of my pager going off at 3 in the morning.

In other words, the person or team building the system needs to be the same person or team that feels the pain if that system is failing. But it’s not just about punishment and pain, that team also needs to be recognized and rewarded when their system is running reliably. This creates an alignment of incentives that promotes the kind of habits seen across top-performing teams.

Track the Four Golden Signals

In monitoring distributed systems, Google’s SRE book outlines the four golden signals of monitoring as latency, traffic, errors and saturation. If I could wave a magic wand and immediately improve the culture of an organization, then I would have service-level objectives (SLOs) tied to these four metrics. But going back to habit formation: If you make acquiring a habit too difficult upfront, then an engineering team will ultimately reject it. So if your organization isn’t mature enough to create SLOs, simply beginning to track these metrics will up-level your game tremendously. They will give you an understanding of what is not working well and help guide your priorities.

To Summarize

Imagine a world where, after a major incident happens, the points of failure involved become DNR work. The same failure is not allowed to happen again and no new feature work will be completed until the fixes are implemented. And more importantly, no new feature work will be completed until those fixes are verified via a chaos engineering experiment, which is then cataloged and run continuously against your system. Then you take that knowledge and share it with other teams so they can run the same experiments and make sure they are immune to those failures as well.

This is how you build a culture of resilience.