DevOps on the cloud may seem simple enough, but in reality it can be very complex
The ultimate goal of a DevOps administrator is to make sure that everything is running properly and seamlessly—so much so that management is convinced that the DevOps administrator is playing solitaire all day. To that end, a successful DevOps team should have their systems automated such that the level of friction they impose on the organization is minimal.
Leveraging a cloud environment makes it look easy—almost too easy, making it look as though all DevOps needs to do is press a button. IT can be deceived into thinking that the level of complexity in DevOps on the cloud is trivial, when in reality it is not.
Running DevOps on the cloud is like renting an apartment in someone else’s building: Even though you don’t own the infrastructure, you’ll have to deal with any issues that may arise. Similarly to a neighbor who’s too noisy, you may experience different levels of performance from the same servers with the same capacity at different times of the day because someone else is running on the same hardware.
There also are security threats that don’t come into play when running on your own hardware. If the same physical machine that runs your most important computations is also being used by Mr. Nefarious, he can use all sorts of tricks and hacks to gain access to valuable data. The Meltdown Inspector attacks is one example—it impacted all cloud providers with severe fleetwide performance impact, sometimes exceeding 10 percent, and required expensive mitigation steps.
Making Sense of Your Gold Mine of Data
The wealth of information provided to system administrators is awe-inspiring, terrifying and frustrating all at once. There is so much information available that, in theory, we should be able to troubleshoot and fix any and all issues. Unfortunately, making sense of the vast amount of information available and nailing down the critical pieces of information necessary can be extremely complex.
The flat network topology typically expected of an on-premises data center is drastically more complex on the cloud. Beyond just VPN and DMZ, there are multiple isolated networks, containers running VMs and running on segmented networks going through NAT with multiple levels of routing and dispatch at play. Trying to troubleshoot in such an environment can be an exercise in frustration, because you don’t own the entire network.
Certain tools and abilities you might take for granted in your own data center are not available, such as Tcpdump, a low-level tool to monitor network packets. For example, in VPN packet loss scenario, your cloud VPN server is shared with other clients, so you won’t be able to capture packets to troubleshoot the connection issues.
Also in managed services—in particular databases—the debug endpoints and tools that often are used in other environments to troubleshoot are not available, to avoid exposing data from other users. For example, you cannot get the full logs from the server to figure out what it is doing. That can hide critical goings that are too precious to ignore in production.
Evaluating Cloud Providers
Since it’s inevitable that problems will occur, one of the important considerations for selecting a cloud provider is the level and quality of support they offer. It’s important to go beyond simple SLAs and evaluate the actual quality and effectiveness of the support you can receive.
Take, for instance, a situation where a machine would freeze, effectively at random, for long periods of time. All monitoring and metrics have told a simple story. The machine was up, had plenty of resources and capacity to spare, but simply would not respond every now and then for any connection attempt whatsoever. Troubleshooting involved stripping the problem to its core and being able to reproduce this in a minimal fashion, at which point one would contact the cloud provider support to try and understand what is going on. The actual problem turned out to be network port exhaustion in the NAT router, which caused a remote server to reset TCP connections, leading to long timeouts that froze the system. This particular problem took days to resolve and required a major change in the deployment topology.
Running DevOps on the cloud is not an easy task, and the cloud itself is not a silver bullet. Rather, it’s a way to democratize hardware and operation resources and work effectively at scale. If you can manage the new challenges, you can enjoy a new level of efficiency.