The wave of digital transformation that has been enabled by cloud computing has provided both new opportunities and fresh challenges. On one hand, it is now possible for application developers to provide innovative new functionality to end-users almost instantaneously. One the other hand, this agility has created an environment in which users expect constant improvements and demand ever-higher levels of service. Indeed, the need to increase agility and quality simultaneously is largely responsible for the growing popularity of DevOps practices.
When I ask customers who say they are “doing DevOps”, the conversation always seems to revolve around speed. Iterate faster. Turn out more innovations and new features – multiple times a week. Pull in microservices from this cloud provider over here, program in polyglot languages over there, virtualize and automate everything on no-name infrastructure, and do it all as quickly as possible. Leverage continuous builds and immediately push to production in a perpetual beta model. Test only at code-time, or even not at all and instead let the end-users test it for you. Forget about run-time instrumentation and monitoring — just get the feature out the door, wait for social media to report any problems, and in the meantime, move on to coding the next feature.
It all sounds fun and productive, in the same way that dot-com-era rapid application development was fun and productive. And it certainly can reduce the cycle time on new features and provide opportunities to introduce new innovations. But that scenario as described isn’t truly DevOps, and it is built around a dangerous assumption about unlimited end-user tolerance for failures.
Take a break from coding the next feature, stop, and read the first paragraph again. If you look at it, the only actual “Ops” in it is the automation used to enable the development process and push to production. In other words, it’s all “Dev” – as if all important activities stop the moment an actual user touches the application. But does that make any sense? After all, aren’t we building all of these cool new features for end-users? And isn’t the reason that we want end-users to consume these features that the applications support a business goal of some kind? And for end-users to consume these features, isn’t it important that our applications actually work and provide high levels of service in production scenarios?
Here are a few tips to ensure there is some “Ops” in your DevOps efforts.
- Stop denying there is a problem. IT Operations is “so ten years ago,” many say, because the infinite elasticity of cloud and built-in exception-handling of modern programming environments renders all of that old-school NOC stuff as antiquated as the pager those folks used to wear. But that assumption is patently incorrect – just look at November 2015’s “Black Friday” shopping season, which was a mess for many online retailers, as noted in the Wall Street Journal, Business Insider, CBC, Express and the Independent. IT Operations is more important now, not less. DevOps without Ops is not sufficient.
- Eliminate operational information silos so you eliminate finger-pointing. Collaboration is a central tenet of DevOps and yet most people troubleshoot through inefficient “war rooms” full of individual component owners who are feverishly looking through their individual consoles to try and ensure the problem isn’t with “their” stuff. Many of those well-intentioned component owners bang their own heads against the table in frustration about how hard it is to even access the information they need in the first place, not to mention the fun job of trying to manually correlate timestamps and hostnames between their and others’ individual operational data stores. There’s no reason to perpetuate the silos of yesteryear – instead, put your operational data in one place, and let today’s machine-learning-powered management tools do the heavy lifting for you. You will certainly reduce finger-pointing, troubleshoot faster, and you may be able to eliminate the “war room” entirely.
- Monitor what (really) matters – your actual end-users. The concept of real end-user monitoring (monitoring the actual activity inside of browsers, mobile devices, etc.) has been around for a while, yet remains an under-utilized capability in most DevOps organizations – perhaps because it used to be technically hard to implement. But that was in the past – today’s next-generation application performance monitoring tools make end-user instrumentation simple to deploy. It’s important to see real user activity not only because end-users are the whole point of the exercise, but because server-side monitoring alone doesn’t tell you what’s actually going on. As late as June 2014, in an Oracle Applications User Group/Unisphere survey entitled “Performance Under Pressure: The State of Enterprise Web App. Quality and Availability,” 85% of customers reported that usually learn about application problems from end-user complaints – a staggering number which is actually up from 79% in 2009. There is simply no excuse for such a reactive view of monitoring.
- It’s in the logs! Logs are everywhere, but most organizations don’t use them because they are overwhelmed – logs contain too much data, they are too widely dispersed and too hard for humans to decipher. However, logs represent one of the best ways for development and operations to collaborate together, since the logging that is enabled by application and platform developers only becomes full of information in production. Next-generation management clouds that are designed to ingest big data at enterprise-scale can cope with today’s log data volume and velocity – which can be on the order of terabytes per day. Once the information is collected, your unified management regime automatically takes you to the few individual rows in the few individual logs that correspond to the exact problem you’re looking for (at the moment you’re troubleshooting), this mountain of data can become a treasure trove of information. Similarly, if machine learning algorithms can discover patterns and anomalies automatically, your logs can become ongoing sources of high-value IT and business insights. The data is there, you just need to use it.
- Planning is an everyday activity. DevOps organizations are necessarily often focused “in the moment” – but future business results depend on accurate advance planning. In most organizations, planning takes place in artificial time windows (calendar year, budget year, off-season, etc.) with incomplete information and a whole lot of guessing – and is often performed by analysts with no direct connection to the operational context. Instead of viewing planning as a once-a-year exercise in an ivory tower, regularly leverage analytical capabilities against your unified store of operational information to answer a variety of forward-looking questions. Your DBA wants to know if she has enough database capacity for next year? Analyze at the individual instance level. Your marketing people are worried that you can’t support their upcoming launch? Analyze at the application level. Your CIO wants to know where people are spending their troubleshooting time and the corresponding health of your entire infrastructure? Analyze across your entire estate. If you’ve followed my advice in steps 1 through 4 above, you have all the data you need already available. Now it’s time to use it.
But hang on a minute, you say. Let’s not forget that we are doing DevOps for speed – and all this management, while clearly useful, could take forever to implement and slow everything down in the process. A few years ago, you might have been right. But technology marches on and today’s cloud-based management regimes can be leveraged instantaneously, with no changes to your infrastructure, and start providing the kind of value you need within minutes. With cloud-based IT Operations Management, your DevOps efforts can truly combine development and operations best practices while still delivering the speed you need.
No more excuses. Put the “Ops” back in DevOps.
About the Author/Dan Koloski
Dan Koloski is a software industry expert with broad experience as both a technologist working on the IT side and as a management executive on the vendor side. Dan is a Senior Director of Product Management for Enterprise Manager – Oracle’s integrated IT operations and Hybrid Cloud Management product line. Dan holds a B.A. from Yale University and an M.B.A. from Harvard Business School.