When I first started migrating companies to the cloud (about seven years ago), companies would ask me, “How come the app isn’t scaling? How come users are still complaining about the slowness?” I would explain that the cloud enables auto-scaling, but you still have to make sure your app can scale in the cloud. In people’s minds, auto-scaling and the cloud was synonyms.
I don’t get asked that question anymore (maybe people realized the cloud is not as magical as it seems). But now, people think the cloud is synonymous with speed, agility and cost-effectiveness. Sure, cloud enables all that. But you still need to make sure you optimize the cloud to facilitate that. I see many companies in the opposite situation: their accounts full of resources with names such as test1, test2 and finaltest2, or they put together a SWAT team to reduce an AWS bill, or security is an afterthought. So many engineers are making changes, it’s hard to know who is doing what.
Before you run into this situation (maybe you’re already there), you have to create tools and processes to optimize your cloud for scale.
Establish Strong Policies
The magnitude of changes in cloud infrastructure is high. Unless you establish strong policies, managing all the resources across multiple cloud accounts get out of control pretty quickly. The most effective way we have seen is using tags for resources. For example, use tags that identify whether the resource is from a certain department are added to each resource. Typically, it works very well when the tagging process is embedded into the automation. If there are new resources that don’t comply, automation should terminate these resources. You should enforce policies like these sooner than later.
And just because these resources were provisioned with proper tagging, it doesn’t mean they are provisioned for its full capacity. You should build automation to find underutilized instances and establish a process for downsizing. To ensure everyone complies with these policies, you should also establish change management process.
Change Management and CMDB
The objective of the change management process is to ensure all the infrastructure changes cause minimum disruption and meet internal guidelines. Many say that ITIL change management process is not compatible for fast environments such as cloud. I agree that the traditional way of doing change management by tracking changes manually just doesn’t work, due to the magnitude of changes. But in the cloud, you can leverage automation to streamline the change management process; for example, you can use automation to filter normal changes versus manual changes. Once all the AWS changes are reviewed and approved, in addition to bringing visibility to all the changes for security and reliability change management can simultaneously help reduce waste.
Most companies heavily leverage auto-scaling, where new resources can be launched based on specified metrics. However, this behavior makes approving changes difficult. If new resources are appearing and disappearing, how do we track what truly changed in the infrastructure? To gain a true picture, you need an infrastructure delta, which shows resource updates for the still-running resources for a given time period. You can also use automation to build a dynamic change management database, and use auto-discovery to keep CMDB in sync.
Shared Responsibility
DevOps calls for breaking down the walls between operation and development team to deliver code faster. If you release software fast, you can rapidly add value to your customer. The cloud enables this culture because developers no longer have to wait for IT to provision resources. What doesn’t work well, though, is when no one takes responsibility for optimizing cost and maintaining security. If engineers can launch resources, they are also responsible for cost and security—it’s a shared responsibility. You should provide visibility to the cost for each department, based on tagging we talked about earlier, so they know the cloud is not free 🙂 Anyone can launch resources as long as they understand the security best practices for each service they are using.
Conclusion
Lean thinking calls for reducing waste. What’s waste? Anything that doesn’t add value to the customer. You should work fearlessly to reduce waste. You move to the cloud because you can innovate fast and it’s cost-effective, so you can quickly add value to your customer. You should continue to optimize your cloud infrastructure to ensure it’s meeting its original intent.
About the Author / JT Giri
JT is the creator of nOps – AWS cloud security tool. He spent last 10 years helping companies migrate to cloud and build automated infrastructures. nOps is a next generation security and collaboration solution for the cloud. nOps applies change management practices to deliver integrated security and reduce cloud waste.