Operational Overhead - or why hubris is bad.

Stop me if you’ve heard this one: A guy walks into a meeting and says “We need a system that does something”, and the engineer in the room replies “Oh, we can totally build that”. Most of the punchlines to this joke go “we can totally build it, and now we have to own its lifecycle and we have sacrificed valuable time and resources for something we should never have built.” ba-dum-ching

I am guilty of playing my part in these conversations, as I’m sure many of you are. About two years ago, as our budding startup was really starting to take off, I was in one of those meetings. Some of the developers wanted to try a recommendation engine, and were going to use Mongo as part of it. Frankly, I thought they wanted to screw around with Mongo and needed an excuse as much as I thought they were going to really do something. At the time I was drowning – running our cloud systems, our LAN, facilities, purchasing, and I had just been tasked with bringing some sanity to the tech side of our video post-production process. I. Did. Not. Have. Time. For. A. Mongo. Setup.

See, here’s the thing. Most of the time the system administrator in that conversation, trying to be a good devops partner, replies “oh sure I can stand up mongo”. But let’s be honest, it’s not just standing up Mongo. It’s handling upgrades, security, failover/HA, backups and restores. If this recommendation engine actually worked it was quickly going to become a key part of our infrastructure. I had just enough presence of mind to know that while I probably could find an hour to stand up mongo, I didn’t have the several other hours needed to do it well. And really, I didn’t have an hour. So I quipped a now-famous-and-often-used-to-troll-me response “Look guys, we need to reduce our operational overhead. Let’s find an outsource.”

Ten minutes and a credit card later, we were setup with MongoLab. Two years later? The recommendation engine is awesome, and I have not cared one ounce about mongo. No upgrades. No middle-of-the-night pages. No architecture discussions. It just works. I’d like to spend a few words dealing with the standard objections:

OK we can try outsourcing, and if it works we’ll bring it in house. Right? Because you never run key pieces of the infrastructure out of your control. Right? Because, if you do… then… that’s bad. Right! Ye Olde World IT Wisdom is pretty clear on this point, you can only ever be sure that something is good if you’re running it in-house. In my experience however, you’re probably more likely to fumble a backup, fail an upgrade, or bungle a failover than an outfit whose sole job is to do those things.

But! it’s expensive! Right? Just about any *aaS provider has an entry level pricing structure that lets you screw around for less than the cost of that team lunch you’re buying on Friday. If you’re in production, and you are making money, or even better yet more money because of this infrastructure choice… justifying the cost is stupidly easy. As long as the cost of the service is less than the money you are bringing in, good job. If you aren’t actually making money, then what exactly are you doing?

We can’t trust Service Provider X. Guys, these companies are in business to provide 1 thing for you. They literally have 1 job. You probably have 14 jobs. Are you asking for a 15th? A 15th thing that you have to watch, and care for, and feed, and love, and hate? A thing that is not core to your actual business? Shouldn’t you make that 15th job something directly beneficial to your team/code/organization? Isn’t that what you’re here for?

OH GOD NO NOT VENDOR LOCK-IN. Hey genius, you know what? You have to write code, and whether you are writing code talking to a database in your environment, or a queuing system in someone else’s… it’s just code. I hope to spend most of the rest of my life writing about how utterly stupid this argument is. Yes! If you fire a vendor you have to refactor. You know what? If you upgrade your in-house stuff you probably have to refactor. If you change the in-house system you have to refactor. Anyone prescient enough to foresee all possible outcomes, and who can make perfect, refactor-free decisions can talk to me about vendor lock-in. The rest of you should move past this objection and get back to doing something productive.

You know what my favorite operational overhead story is? Amazon SQS (Simple Queue Service). Different day, different developers, same idea. Wow, our little website is now processing enough orders that we need asynchronous fulfillment… I know! Let’s setup a queuing system! I used ActiveMQ once! Or RabbitMQ! Or… we can use the same queuing system that runs the largest ecommerce site on the planet. Where there is literally a horde of highly trained monkeys watching every teensy little thing that makes it work 24x7x365. Someone else has already thought of every imaginable permutation of scale/resiliency/security. And you know what? It costs 0.0000005 per request. I can process 14 million orders at a net cost of $7. Yeah, let’s totally spend time standing up something better and cheaper.

Good engineers are good because they know they can build things. Even when they don’t know they can, they are pretty sure they can figure it out. DevOps should be about making smart decisions, and hubris is not smart. Use the best tool at hand. Rebuilding every wheel, administering every platform, doing everything yourself… is just adding to your operational overhead, and subtracting from your ability to be impactful to your business.