Between agile development methodologies and DevOps, we have largely removed the stigma of incremental – or even architectural – failures in development, deployment and, more recently, security and operations. Even data management, with increased redundancy and recoverability, has moderated its crazy responses to failures.
This is all good news. Rather than focusing on blame, focusing on fixing the problem and stopping it from happening again are better uses of our time. I’ve pulled some really dumb stunts in my decades of IT (most recently, I trusted a hardware vendor that historically I had trouble with. Unsurprisingly, we are paying that price now. Replacement parts are arriving today–required because they still try to lock you in). So have each and every one of you. No matter how well-trained an individual is, on-the-job training is a fact of life in an industry where most technology innovations last about as long as a cappuccino and we’re asked to know a little bit about a lot of topics to complement our specialty. (I’ve been a dev, worked in ops, as an architect and in security … none of those jobs was doable without working knowledge of the others).
All of that is to say, “We’re admonishing each other less and fixing things more, and that’s good.” I don’t need to point out that there is a balance implied in that statement; most managers are well aware that occasionally a person’s skills/interests aren’t a great fit for the position, and changes have to be made.
But the point of this blog is to say, “We’re really good at acknowledging failure at the granular level, correcting, and making sure it doesn’t happen again. But only at the granular level.”
Yes, the person who championed a tool or system into the organization is going to feel the need to defend it, and we have to acknowledge that this is no different than the developer who made a bad architectural choice feeling the need to defend it. Some tools suck. Some implementations are not great for our use case (I know a group of smart people who can help you figure out which tools do; follow my LinkedIn) and some tools we’ve moved beyond their capabilities.
We make technology decisions for the organization almost every day. Some of them won’t work out in the short term – we’re not terrible about cleaning those up outside of the inimical “executive sponsorship” that sometimes saddles us with suboptimal solutions. But the ones that are kind of successful are the bad ones. We end up limping along when most staff know there are better solutions on the market or even new markets that solve the problems more elegantly.
Most organizations have some form of technology review process–many smaller orgs have that person who makes final decisions, and many larger ones have committees that look across product lines and technologies. Those reviews have only kicked in when something is obviously a hindrance to operations because we don’t have time to muck around with things not broken. And yet, that is not the Agile/DevOps way. We need to be aware of the friction points and smooth things out. I’ve written before about considering standardizing toolchains, and that’s one thing this could evaluate–number of tools versus duplicative effort and disparate training-but a look at any tool included in multiple issues is a good idea, as is “Prove to me this is the best solution”-type conversations.
Keep kicking it; accept that there are reasons that some suboptimal tools will exist, but remove those that provably have better/more efficient solutions for the organization’s needs. I once moved an entire massive data store across platforms because my initial scale-up approach slowly became less appealing than the competitor’s scale-out approach to the same problem. The people I saddled with doing the conversion gave me more grief than my fellow managers, but things worked better afterward. So be honest, but be willing to admit that one risk of the hundreds you take didn’t pan out. And keep the company rolling forward. Your work is critical.