Pendulums and DevOps

I have long noted the trend of pendulums in IT, particularly in organizations with longer histories. Centralized IT will be achieved, its weaknesses noticed and a movement will begin to decentralize. Decentralized IT is achieved, and people remember its weaknesses, causing centralization.

The same is true with speed of delivery versus quality. Faster will be more important for a while, then the reality that faster causes quality concerns will surface, and reliable becomes the keyword, causing slower rollouts of better product.

This has been going on since I started in IT, and while it is more evident in enterprises, any company that lasts long enough will see these cycles run through.

Agile and DevOps, as the Great Disruptors, stood a chance of redefining the tensions that cause these pendulum swings. Centralized versus decentralized is less important if stakeholders are all truly involved and informed—something that the culture side of the DevOps crowd aimed to achieve. Similarly, fast and quality were promised, as the ability to recover from inevitable mistakes would be rapid, while integrated testing would uncover most issues.

We have enough years under our belts to know at this point that these issues are, at best, resolved with a resounding asterisk that is clarified sometimes.

One nice thing about DevOps (more than agile) is that it has not stopped evolving yet, which means there is still hope that rapid delivery and high quality can both be answered while involvement of business owners and all the various IT groups will reduce the centralized/de-centralized cycle. But we’re not there yet.

Split.io—purveyor of feature flagging/tagging tools—has a pretty good blog about what they discovered when they went out and talked to hundreds of shops. It shows that failures are happening at an alarming rate, and at some number of shops, for your average bug, they will still be fixing this released issue when the next release goes out. That is asking for trouble, in my experience.

Better integrated testing is certainly part of the answer. TDD is good, but not enough, since it is more granular than some tests require. Split believes that feature flags are part of the answer, and I believe they’re right. But this approach is a band-aid to the problem.

By the time you flip a feature flag to turn off a bad feature, you’ve already released a bad feature. And you will be spending time researching and fixing it. I believe feature flags should be used to catch it when this (inevitably) happens in a complex coding environment, but at the same time the source of issues should be looked into. An ounce of prevention is indeed worth a pound of cure.

Figuring out what is allowing bugs to slip through into production rather than counting on remediating them (using precious time to do re-work) should be a priority moving forward. Is there a process/policy that is at the source? Is it simply a training issue? Training issues are repetitive—teaching today does nothing for new algorithms/tools put to use after today, and turnover can create a training gap if you’re not staying on top of it. For some shops, is speed of delivery faster than some piece (normally testing, but some shops suffer for development on certain platforms also) of the delivery chain can adequately respond to? Do you need to re-evaluate toolchain and delivery timelines?

We are delivering more code, faster than ever, and that is exciting. But quality matters, and anyone who relies on end-users to test is playing a very dangerous game. Find ways to ensure quality without slowing down the process more than necessary. At this point, I honestly would implement runtime-flippable feature flags as a starting point, simply to be more responsive to issues. From there, work to resolve the underlying reasons that bugs are cropping up.

Until then, keep cranking it. Keep working with the business to get quality systems in front of users, and enjoy doing some great work.

— Don Macvittie