DevOps Pro Tips on Cloud Management

In this DevOps Chat we are joined by Andy Richman of Park My Cloud and Samir Mehra of Cloud Health Technologies. Both of these cloud experts give sound advice on how you can best and most efficiently manage your cloud infrastructure, whether it is on AWS, Azure, Google or anywhere else. If you are interested in optimizing your cloud deployments and cloud management in general (who isn’t?), then have a listen.

As usual, the streaming audio of our conversation is immediately below, followed by the transcript of our conversation so you can read along. Enjoy!

BTW, we have recorded two other podcasts with Park My Cloud. You can listen to them here and here.

Alan Shimel: Hey, everyone, this is Alan Shimel, DevOps.com, and you’re listening to another episode of DevOps Chat. We’ve got a great chat lined up today. We’re gonna be talking a little bit about managing your cloud infrastructure, cost, and how much do you really know about or how much insight do you really have into your cloud infrastructure.

We’ve got two great guests lined up to join us and help us today. First, I have Andy Richman. Andy is with ParkMyCloud, and Andy, welcome to DevOps Chat.

Andy Richman: Thank you, Alan. Looking forward to this.

Shimel: Okay, and you know what, Andy, before we introduce Samir, why don’t we just, for anyone who may not be familiar with ParkMyCloud, why don’t you give us a little background on what your role is there?

Richman: Sure. So, I’m the Product Manager over here at ParkMyCloud, and I think about all things automation and saving money in the cloud. The way we describe it is cost optimization. We launched the company in September, 2015, so we’ve been around just over a couple of years now, and we work with all manner of companies, from large Fortune 500 type companies, right, all the way through to smaller startups born in the cloud, and we’re helping all of those people effectively manage their infrastructure and get the most bang for the buck out of it.

Shimel: Excellent. Okay, we’ll come back to you in a second, Andy, but first let us introduce Samir Mehra—and I messed that up Samir, I apologize. Why don’t you correct me?

Samir Mehra: It’s Samir Mehra.

Shimel: Oh, I wasn’t too bad, then. And Samir, why don’t you tell us your company and background

Mehra: Yeah, so I’m Director of Product Management here at CloudHealth Technologies. CloudHealth Technologies is a SaaS platform that provides visibility into all your cloud environments, so customers can really look at areas for optimization, and this can include cost, usage, configuration, security, and performance. And then we also have ways that you can automatically remediate some of these optimizations with our policy engine.

Shimel: Got it. So, Andy, Samir—not to boil it down to dollars and cents, but let’s boil it down to dollars and cents. Is cost sort of the biggest factor in what companies are looking at in terms of managing their cloud infrastructure? I mean, you guys do similar but different tasks for organizations using cloud. What’s your opinion? Is cost the number one priority there? Andy, let’s go with you first.

Richman: Sure. So, it depend where you are in terms of your company’s cloud evolution. So, if you’re moving from an on prem data center type operation into the cloud, then no, it probably isn’t. You’re thinking about migration, getting workloads moved over, you know, getting performance in the cloud.

Once you’re in the cloud, you know, obviously, you’re looking to optimize the performance, you know, you’re probably looking to scale up what you’re doing there. And again, cost is moving up, but it probably isn’t number one or number two. But once you’ve been in the cloud for a couple of years and, you know, month on month, you’ve seen those cloud bills climbing, then typically, yeah, it begins to get a lot of attention. And, for most of the companies we’re working with, they’re all at different stages, but probably the majority are pretty focused, and it probably is in that top three, top five for the company.

Shimel: Excellent. Samir, how about you?

Mehra: So, I look at it in a different way. It depends on what type of company you are. If you’re a born in the cloud company, you know about how cloud is supposed to be managed. I think cost is not the important thing, it’s the efficiency with which you want to execute things and how quickly you want to get up with your production infrastructure or your test dev infrastructure. And really, I think those companies are very apt at managing some of their costs and making sure their cost is optimized, because they only have been dealing with opex.

But, when it comes to enterprise companies, I feel some of them, even though they’ve migrated a bunch to the cloud, they still think of cloud as something which is capex. They generally overprovision when they don’t need to overprovision. You know, cost is secondary for them, too, but they have so much of extra spend in what you call cloud sprawl that they really need somebody to help them visualize it and give them ways to optimize that cost.

Shimel: Yep. So, you know, I always like to say that the costs are almost symptomatic of something else, though, right? Especially when we talk about things like cloud, guys. You know, the costs are relatively set. You can look up on a price chart and see, you know, what a gigabit of storage is, how much CPU time costs or what have you.

I think when we talk about the cost of cloud and stuff like that, really, the underlying thing is a question of control. Do we really have insight into what we’re using and how we’re using it, right? And, because we don’t, it translates into a cost we may not understand or expect or even know what the heck’s going on. But really, the underlying issue is control and more management of what this is.

And both of your organizations kind of play a role in that. Why don’t you—and not to create a knock-down, drag out, but can you give me a contrast, Samir, let’s say, in what your company does versus what Andy and the folks over at ParkMyCloud do?

Mehra: So, I’ll start with the fact that, you know, from what we’ve seen our customers spend, IAS and virtual machines or EC2 instances are really still the biggest spend. People haven’t really transitioned into a lot of TAS services, though TAS services definitely are picking up.

The way we do it is, you know, to help prevent cloud sprawl, we do have a policy engine where you can set up a policy which tells you if there’s an instances that is started that is more expensive than what you want an individual to set up. So, if they go spin up a really huge instances, you can get alerted, you can actually set up a policy which turns that instance down, even before it gets to run. And the other thing is, you know, the way we prevent sprawl also is, we give you—we, again, use a policy engine and tell you, “You know what? There’s somebody who spun up an instance,” and they have them tag those instances. So, there’s no way to know who’s actually spun that instance up, and if you don’t know who spun that up, you can actually now go and ask them, “Why did you spin it up, and what’s the purpose of this instance?” if it is an expensive instance that they’ve spun up.

So, we definitely rely on our policy engine a lot, on making sure we have these governance policies implemented to help prevent cloud sprawl from the beginning.

Shimel: Excellent. Andy, your take on it?

Richman: Yeah, sure, so we do have a slightly different viewpoint on this. So, when that cloud bill drops in at the end of the month, typically about 70 percent of it is being spent on compute, just as Samir said. If you actually look within that compute, what you’ll quickly discover is, about half of it is for production systems—these are things that typically need to run 24/7—but the other half is your dev, test, QA, data analytics type environments. And there’s at least the potential to turn those instances off when they’re not being used, and we’re really focused on that half of the infrastructure.

So, what we do is, we give DevOps, IT Ops, you know, engineers an easier way to get those instances turned off. So, in a week, there’s 168 hours, so if you didn’t know, that’s about 730 hours in a month. And even if your team’s working 60 hours a week, which is a pretty hard working team, it still means you can save 65 percent by just turning things off when they’re not being used.

And so, what we’ve really focused on is the automation piece of this. So, to ensure that when things aren’t being used and there’s no value being added, they’re just simply turned off. And we have semi-similar kind of policy engines that automate this, but an ability, effectively, to create a simple schedule, have it applied across your various different teams, and throwing off some pretty significant savings pretty quickly.

Shimel: Got it. So, automation is something that, in DevOps, we understand very well, right? It’s key to what a lot of what DevOps is about. And so, the idea of turning down, shutting off, closing up unused instances, unused resources—I mean, just, it’s a no-brainer, you would think, right?

But why do you think more people or more organizations aren’t doing that kind of thing? Andy, why don’t you go first, if you’d like?

Richman: Oh, sure. I mean, some of them are doing it, and they’re writing on scripts, but they tend not to do it in a particularly systematic manner. And when you actually talk to the people—I spend quite a lot of time talking to our customers—a lot of it is because, you know, they don’t want to take away control from the teams who are out there, you know, delivering code, delivering value to the company. And often, they just don’t have the insights in terms of how those various instances, VMs are being utilized. And it really requires buy-in from the teams themselves.

And so, what we’ve seen in the companies that are doing this well is that they create a lot of buy-in. The thing that people quickly realize is that oftentimes it’s not just about stripping costs out and spending them elsewhere in the enterprise, this is just about efficiency, and that money often will get recycled back into other R&D projects, you know, other things that people wanna be doing.

So, I think the more progressive organizations get it. I think what they’re able to do with a tool like ours is to do it in a scalable way. So, you’ve got multiple teams, lots of individuals, lots of different workloads. We give them a way to do that very, very quickly, simply. It adds a couple of minutes of work to their week and they can kinda schedule it in place and reap the benefits.

The other thing that’s awesome is visibility. This may be something that Samir can talk about a bit more. It’s, once you begin to give back the DevOps teams, you know, the actual visibility on the savings they’re getting, it becomes a kind of a, somewhat of a self-fulfilling prophecy or a kinda closed loop. They see the savings they’re getting, they do more of it; they see more savings, they do more of it—that kind of positive affirmation.

Mehra: Yeah, actually, we—sorry, go ahead, Alan.

Shimel: No, no. I was gonna say, Samir, what about you? [Laughter] I’m sorry.

Mehra: No, that’s fine. So, we’ve seen, we use our product, our DevOps engineers within CloudHealth also use CloudHealth to manage the AWS environment that we have for production for our SaaS environment. And our DevOps engineer is definitely not a financial person, but like Andy said, when he sees those cost savings, his eyes do light up. He’s like, “I can really save this much money if I actually right size some of these instances that are being underutilized.” And he does, he’ll do, over the weekends, or when we can, do a right size of some of our instances to drop our costs down.

So, we actually use our platform ourselves to do it, and our DevOp engineers are using it. And we do have plenty of customers doing similar things, and it’s not just right sizing a particular instance. You can also, you know, look at it from an automatic scaling group perspective. You know, you want to right size the automatic scaling group or the next automatic scaling group that you spin up. You want it to be right sized because you’re gonna have a cost savings associated with it.

Richman: One of the things, Alan, that’s interesting is that people would often quote you the cost of their infrastructure or say, “This is an instance that’s 25 cents now, or 50 cents now” or something like that. What they haven’t done is the mental math—well, okay, what is that gonna mean in a month? And so, a 50 cents an hour instance is actually $365.00 a month. And if you can save 50, 60 percent of that, it’s a significant amount of money, particularly for customers who have tens, hundreds, or even thousands of instances.

Mehra: Yeah, I completely agree. I think once you multiply the one instance which costs 50 cents into 100 and 200, that’s where you actually see the huge amount of cost savings that you can have.

Shimel: Excellent. Good. Guys, as I think I mentioned to you when we got on here, time goes really quick, and we only have time for another question or two here, and I want to focus in on a couple of things. Obviously, looking at both of your company services is a great way to get started along this, getting a handle on what you have in the cloud. But if you had to give, let’s say, a top three things that our listeners should do, beyond just signing up—signing up with your company is obvious. But top three things together between the two of you—what would they be?

Richman: Samir, do you want to go first, or…

Mehra: Yeah, I would say, make sure you have visibility into your spend so that you can make sure that you’re not spending where you don’t need to spend.

Shimel: Excellent.

Mehra: So, cost visibility, for sure.

Shimel: Andy, how about you?

Richman: Yeah, so for us, it’s really—you know, focus where the largest amount of potential savings are. And, based on our experience, that’s in your non-production workloads. So, when we focus on that and then start to look, you know, talk to your QA team, talk to your development team, talk to people running analytics workloads, and really understand, you know, how many hours a week those instances truly need to be running. I mean, even if you’re conservative and you only turn things off in the really small hours of the night, the savings really quickly mount up. And then, to really start to do it systematically, kinda team by team and rolling it across, rather than just more of an ad hoc kind of effort.

Mehra: Yeah, and I can add a third one along with that is, you can also do it on your production infrastructure, and there’s two ways to really do it. One is, you know, right size your production infrastructure and you can do it during a down time where you can right size during a down time, but also look at reservations, because reservations is a huge cost saver compared to running on demand.

Richman: Yeah, absolutely. I mean, one of the things that we have recently added to the platform and we’re gonna be adding a lot more in the next couple of months is bringing in the utilization data. So, you know, looking at your CPU, your network, your disk usage and really using that to drive these changes.

We’ve recently introduced these kind of simple heat maps where people can come in, look at the platform, and see the hours when their instances are actually being used. Rather than just asking people when they’re using, you know, what’s being used when, you can actually see the actual usage and, post-Christmas, we’re gonna be moving into the right sizing space in a similar way using similar kinds of metrics.

So, there’s definitely a lot of tools out there that can help you do this in a more informed way.

Shimel: Yep. Agreed. Well, guys, we’re just about out of time, here. Samir, Andy—thanks so much for being our guests on this episode of DevOps Chat. You know what I didn’t realize—Andy, for people who want to get more information on ParkMyCloud, is it ParkMyCloud.com?

Richman: Yep, that would be ParkMyCloud.com.

Shimel: Yep, and Samir, URL for your company?

Mehra: Yeah, it’s www.CloudHealthTech.com.

Shimel: So, that’s Cloud Health Deck, D-E-C-K.

Mehra: T-E-C-H.

Shimel: Okay. Got it. Alright, T-E-C-H. We’ll put it in the show notes as well, in case anyone is not clear.

Andy, Samir, thanks for being our guests. This is Alan Shimel for DevOps.com, and we’ll see you on another DevOps Chat very soon, everyone. Thanks for listening. Have a great day.

— Alan Shimel