As engineers, we spend a lot of time talking about things such as release processes, QA environments and deployments. But at the end of the day, most software systems fail because the software itself is faulty. The code doesn’t lie—you can almost always find the solution to your problem within the code itself. Of course, looking to the code for answers is a time-consuming process if you’ve deployed too much in one batch. After all, sometimes the tiniest mistake can derail a whole string of code, and having to sort through a haystack of code looking for a tiny needle is not fun.
There are a few corollaries to this mantra that “the code doesn’t lie”:
- “Know your data like you know yourself.” Sometimes the quality of the data itself can cause a system to behave differently.
- “A feature is a bug in a tuxedo,” which is largely for you product folks out there.
- “Where there is one bug, there are many.” When you find a problem and it leads to a systemic set of problems, it may seem terrible at first, but the reality is, you have the opportunity to fix a lot in a short amount of time.
I had a chemistry teacher who used to say, “A mole is a mole is a mole,” referring to Avogadro’s number. My version is: “A bug is a bug is a bug.” A feature that behaves incorrectly (even if built to spec) is a bug. A badly performing system is a bug. A transient failure is a bug. A race condition is a bug. No matter what they look like, it’s our job to squash the bugs.
Branch Readiness is a Joke
A very talented engineering director once made the comment, “Branch readiness is a joke,” about the branch/development model. As 23 different teams were preparing their branches for integration onto the main trunk, code changes were flying around with very little testing/certification. We even had code that would not compile.
When all 23 branches hit the main release branch, all hell broke loose. And this is where the actual functional, integration and performance testing began. The key problem with this model is that integration and certification happened late in the game. The other problem was that we did not abide by our own criteria of what constituted branch readiness; instead, we were constantly softening the restrictions until bad code made its way into the system.
We moved to a trunk-based model of development where check-ins are made directly to the trunk, and all code must compile and be pretested. Continuous integration and testing are run against the trunk all day and all night. We can ship from the trunk at any moment.
Now you may be asking yourself, What does this have to do with operations? The answer is: everything. We can control what changes get to the site. We can certify what changes get to the site. We are assured at every moment of a minimum quality in the changes that do get to the site. These smaller changes give us the ability to move bits onto the site more frequently. We are not forced into large-scale, monolithic deployments, which inherently create more risk to the site itself, often with little ability to roll back.
Learning from ‘Branch Readiness is a Joke’
Just because the code doesn’t lie does not mean that you can always understand all of its implications. The key is limiting the amount of code being changed in any individual release to an amount an engineer or automated system can understand. By avoiding large, monolithic releases containing thousands of changes and instead testing, verifying and considering each change individually, we can understand what the code is telling us. We can release to the trunk with confidence that what we build and deploy is going to work.
The One-Character Change
We live in a world of bits and bytes (eight bits per byte). One byte is the equivalent of one character. In some cases, a one-character change is benign. For example, changing the number of members of a site on an informational page from 134M to 135M (4 to 5) will not cause any harm to the site.
But sometimes, a one-character change can be disastrous. Consider a DNS change that has a one-character mistake for www.yourcompany.com. Get it wrong and you are off the air. It is very important to understand the impact of a change going awry. If it can cause a large impact, then we need to ensure we understand the change and have a clear plan to roll it back if we need to.
Here’s a great example of a very small change going sideways very fast: We once scheduled a maintenance to test routing traffic through a European point-of-presence. Unfortunately, the time-to-live parameter was set to a large number (hours instead of minutes) on the DNS entry. The result was that the change, once implemented, couldn’t be fully undone for hours. It was a one-character change with a bad outcome.
Learning from the One-Character Change
The code doesn’t lie, but that only matters if we are paying attention. Put simply, not all changes are equal. By taking the time to think about what a change is intended to accomplish and the various ways it could go wrong, we will always be better off than if we didn’t do due diligence. The one-character change can be benign, or it can be catastrophic. It’s important to figure out the possible ramifications of any changes (no matter how small they may seem) before proceeding.
Operations tends to be the first group to get called after hours when something goes wrong. As a result, we have a vested interest in quality of code shipping to our site. By working to improve the code quality before it hits production, you can significantly reduce the number of problems you encounter.
Start with where the code gets committed into source control. For each change made ask the simple questions: Does the code still compile? Does your application still build correctly? Does it still work as intended? By automating the process of asking these questions for every change, we gain the ability to deploy new versions of our application based on any commit.
Once we have good code quality making it into source control, we can begin making further improvements through the use of a canary process and start focusing on things such as performance. Remember, the code doesn’t lie. If we can understand each change, then we can start preventing problems before they begin.
This post is part of the series “Every Day Is Monday in Operations.” Throughout this series we discuss our challenges, share our war stories and walk through the learning we’ve gained as Operations leaders. You can read the introduction and find links to the rest of the series here.
About the Author / David Henke
David Henke has more than 35 years of experience working in technology, including senior engineering leadership positions at LinkedIn, Yahoo!, and AltaVista Company. He’s also been a founder at two different software companies, both of which were acquired. Currently, David serves in a variety of board and advisory positions with organizations like NerdWallet and UC Santa Barbara.