An Embarrassment of Riches

History says people rarely appreciate what they have in the heat of the moment. Our highly connected and high-speed world is no exception to this rule. The things that frustrate us are in our faces, while the things that make nearly everything better/safer/faster are taken for granted.

That is every bit as true in IT as it is in life. We expect our deploys and releases to go easily, because life is far easier in that regard than it was even 10 years ago. We get frustrated when things don’t work out, because they so often do.

The one area Agile has addressed is that most shops don’t beat the person responsible for an issue until they leave. That’s not to say no one is judgmental, it just isn’t official policy to brow-beat even stupid mistakes. The goal is to get better, not to bash the person responsible for the mistake. Most of the time this is a downright positive change. Particularly since, in complex systems, it is often not one individual’s error, but many.

Before Agile, I rushed out a fix to a piece of software used by several large banks. They needed the fix, and I was most qualified to do it. I used the file handle for the printer to open a printer, and wrote to it. All went well in test and we shipped it out. The results were catastrophic corruption of one set of databases for several large banks.

Unknown to me, someone had re-used the printer file handle if certain conditions were met, and opened it on the database. It being reserved for printers, that never occurred to me, and I simple reopened it and used it. The strange conditions that pointed it at the database were never encountered in test, and I was responsible for some serious damage to our reputation and the banks in question’s confidence in our software.

Thankfully, our development team was already on “do it fast, don’t point fingers,” so it was mentioned that better testing could have been done, and we implemented plans to do just that, using scrubbed customer data. And the root cause–using a reserved file handle for access to something else–was fixed. My own angst as having created the mess made me sit with operations while they rebuilt the destroyed databases one at a time, using some astounding Unix command line wizardry. In the end, the software and the team were better off for the experience. And I got an early lesson in not trusting that code would behave as-expected.

The point being, bashing me for a mistake that put us all on high-stress because our biggest customers were rightfully angry would not have solved the problem. Letting me work through, identify root causes with the test team and get things straight worked better.

We’re facing a similar hill in DevOps today. Do you test in production? How much do you test in production? What if you have similar totally unforeseen issues crop up?

The answer is you must test in production. Inevitably, there will be errors in production, it is far better to find them early and fix them than to test in production because the app is live and there were things you couldn’t test outside of production. Either way, you’re testing in production, the question is how much control do you have.

This is one of the strong suits of DevOps. Using things such as feature flags, monitoring and AIOps, you can spin up a small percentage of production servers with changes. Or deploy parallel servers in production for low-server-count apps, then just stream a few customers to them to test out new changes. Yes, those customers might have a terrible experience if there is some unforeseen issue, but better a few customers than all of your customers if you rolled out without testing. For some industries, building a list of people willing to be used to test new features like this can be a marketing ploy.

Our frustration at things going wrong will still be there, but the joy of catching those problems before major catastrophe hits our production servers should be there, too. Not long ago, you rolled it out to everyone and dealt with the consequences. Crashes were big and even more frustrating. Better to have a few customers impacted and the ability to fix issues before it’s an emergency.

I’ve said it before, we work in astounding times with giant leaps forward in responsiveness. Take advantage of them. And take time away from making and running great software to appreciate them, if only for a moment.

Then add more features and fix more bugs. We are in the heat of the moment after all.

— Don Macvittie

Tags: agileAIOpsapplication monitoringcontinuous operationsfeature flagsproduction testing

5 years ago

Don Macvittie

20 year veteran leading a new technology consulting firm focused on the dev side of DevOps, Cloud, Security, and Application Development.

AIOps Success Requires Synthetic Internet Telemetry Data
The data used to train AI models needs to reflect the production environments where applications…
Building an Open Source Observability Platform
By investing in open source frameworks and LGTM tools, SRE teams can effectively monitor their…
Harnessing Generative AI for Feature Management Testing
Generative AI is revolutionizing the way we create testing environments and feature management within DevOps…