Managing changes to cloud assets is a universal pain felt by many engineering leaders today, despite all of the advancements in tooling and practices like GitOps. This is because, in reality, it is simply impossible to always have everything completely locked down—we don’t live in a utopia with zero incidents. If your engineering organization prevents any changes from being made through the cloud console or to your infrastructure-as-code (IaC) without complying with strict GitOps practices or change management processes via CI/CD, then it’s likely you have very frustrated developers—developers who can’t troubleshoot or debug in real-time and who have little control in a real-world incident.
Engineering, like everything else in life, is all about balance.
The Pendulum Swings
The pandemic created a whole new mindset and practice for managing distributed and high velocity engineering and operations. Overnight, companies that were not built for remote ways of working had to continue their operations in a global, distributed and asynchronous way that they weren’t completely familiar with. This required a new way to think about software delivery—and it accelerated the DevOps practices that support that delivery. Self-service infrastructure removed barriers for developers to ensure continued performance and velocity.
At the same time, your cloud cannot be the wild west with everyone creating bespoke infrastructure. It becomes impossible to manage, and misconfigurations can be risky. Guardrails and policy automation have become a hot topic. Today, with tech markets depressed and cloud costs escalating, there seems to be a growing trend toward locking things down again, even at the risk of frustrating developers.
This begs the question: How can you achieve unencumbered infrastructure for your developers while simultaneously following policies and best practices for compliance, risk and cost? There is a way to find balance.
Like many aspects of security, we’ve learned that when constraints and barriers are too high, users ultimately find ways to bypass them. This is true for operations, as well. While sometimes it might seem easier to lock everything down than design a better and more balanced way to enable developers to move fast, eventually this approach backfires. This is the exact same evolution the application and cloud-native security industry are undergoing right now. All the guardrails and controls applied have created too much friction in development processes and developers ultimately bypass them.
Real-Time Insight is Key
CloudOps can learn a lot from the disruption the security industry is going through today. In the same way that point-in-time security has been rendered completely useless, non-real-time alerting or eventual alerting to infrastructure drift just won’t cut it when managing an ephemeral cloud. What is really needed is the same kind of real-time and continuous scanning of cloud assets and IaC, similar to what we apply to our systems through monitoring and observability. These solutions became an essential backbone of our business to ensure continuous operations and availability of cloud services.
As we embrace IaC and the benefits it brings, everything-as-code unlocks greater agility and visibility, enabling you to automatically remediate without locking everything down. As DevOps says, “fail forward and fail fast.” Rather than focusing on never making a mistake, focus on how to fix it immediately.
The Solution
By providing a continuous comparison of actual cloud assets with their desired state through IaC and GitOps, it’s possible to surface configuration drift and policy violations immediately, much like any other kind of breach or major system failure. Failures and incidents are inevitable. It’s unrealistic, and even dangerous, to build systems with an inherent underlying design that prevents you from changing something at the cloud console at 2:00 A.M. during an outage.
With a locked-down approach, a developer would need to wait for the change management approval during a high-pressure production incident, oftentimes at 3:00 A.M. (because the incident gods of always make them happen in the middle of the night) or on a weekend. We don’t need to look much further than the catastrophic 40-minute and $440 million Knight Capital incident to understand that sometimes time is not on our side and that a delay can have severe consequences.
Cloud drift has become equally troubling as uptime or any other mission-critical aspect of continuous business operations, and that’s why we need to apply the same principles for monitoring drift in real-time as we do to our CPU and load. This will make it possible to be alerted when your cloud and your IaC are no longer aligned and even flag issues that can be prevented in real-time. You can automate ticket creation and escalation and also triage whether this is something that needs to be fixed in real-time with remediation suggestions provided or open a ticket to fix it later. The choice should be yours, not made by an administrator who has a lot less context about the systems and the ultimate business impact.
It is Not ‘One and Done’
Some vendors preach codifying your cloud into IaC and then locking it down so that you’ll never have to do it again. This feels like deja vu; we’ve definitely heard this before. It sounds a lot like the ‘write once, deploy anywhere’ JVM promise that never quite panned out.
By going down this path, you’re resigning yourself to never adding new cloud assets or changing your cloud configurations. If there’s one thing we’ve learned about the cloud, it’s that it’s a constantly moving target. (Read what cloud asset management can learn from the world of finance—another industry in constant flux). If this didn’t work decades ago with the JVM, it certainly won’t work in a dynamic and constantly changing environment like the cloud. That’s a fool’s journey.
Continuous Cloud Up to Code
There are major downsides of locking down agility and velocity. This is hard-earned advice, witnessed firsthand from tools like legacy CMDB and startups who have forgotten that cloud is ephemeral—and possibly the reason why we all use it in the first place! So if our cloud is constantly changing and is a continuously moving target, then shouldn’t our tools be built in just the same way to keep up? Solutions built for the rapidly evolving cloud-native era must continuously look for new, unmanaged and changed assets and ensure their cloud infrastructure always follows policies and regulatory standards.
Once we are able to find the balance between agility and control, we’ll be able to tap into the benefits of speed and safety (as DORA research shows time and again) that are the benchmark of high-performing teams. Automating cloud asset management can provide governance, policies and control while still enabling the velocity and agility engineering organizations today require.
Agility Versus Control
There’s a minority of people who claim they don’t worry about cloud asset configuration drift because they have everything completely locked down. Zero changes can be done at the cloud console or to IaC without going through GitOps for strict change management processes via CI/CD. I bet they have happy developers—not!
Life is All About Balance
Throughout the pandemic, organizations had to make developers happy for fear they would leave. So DevOps went full tilt aiming for self-serve infrastructure for developers to remove barriers to speed and velocity. Now, with tech markets depressed, maybe it’s easier to lock things down even if that risks losing engineers.
As we embrace IaC and its benefits, you need not live in the stone age of locking everything down. How can you achieve unencumbered infrastructure for your developers while following policies and best practices for compliance, risk and cost? There is a way.
The solution? Real-time continuous scanning of your cloud assets and your IaC with continuous comparison can surface configuration drift and policy violations immediately. Emergencies are inevitable. It’s bold/stupid/foolish to think you’ll never need to change something at the cloud console during an emergency or that a developer will wait for change management when production is down at 3:00 A.M. on a weekend. It happens, and you must deal with it.