Tips for Migrating and Managing Data Workloads in the Cloud

Unless you are an alien life form who just beamed down from space, you’ve heard all about the why of moving data workloads to the cloud. The cloud is a more scalable, more flexible, more available and possibly less expensive solution for storing and processing data.

What is not always obvious is the how of moving data workloads to the cloud. In several key respects, cloud-based data workloads are fundamentally different than those hosted on-premises.

Understanding those differences is critical for moving data workloads to the cloud effectively. While it’s true that the cloud offers a variety of benefits for hosting data workloads, actually realizing those benefits requires a plan for managing data effectively once it is in the cloud.

Let’s take a look at what that entails.

The Special Challenges of Cloud-Based Data Workloads

Let’s start by identifying the main ways in which cloud-based data workloads differ from those hosted on-premises:

We could/should add a section here about assessment/planning before the actual migration. We have a bunch of published material on this.

  • Migration: The ways in which you migrate data to the cloud (and migrate it from one location to another within the cloud) are different from those you use on-premises. Not only do the tools typically differ (in the cloud, you’ll usually need to use your cloud vendor’s tools and interfaces for uploading and moving cloud-based data), but the speed and process are usually different, too (in the cloud, you’ll typically have to transfer data over the network, which is slower than some on-premises data transfer techniques). In extreme cases, migrating data over the network into the cloud may simply be unfeasible, which is why AWS Snowmobile is a thing.
  • Availability: In some respects, cloud-based data workloads are more available; in general, the cloud is less likely to fail or become unresponsive than your own data center. That said, the considerations for data availability in the cloud are different than on-premises. For example, DDoS attacks are typically a greater threat to data availability in the cloud. As a result, the strategies (discussed below) for maximizing cloud data availability are also different.
  • Performance: The factors that impact data workload performance in the cloud differ in some respects from those on-premises. In the cloud, it’s not the number of CPU cores or memory in your servers that matters, since these resources easily can be increased. It’s the way you build your data workload architecture and how effectively you avoid bottlenecks that shapes the performance of your workloads.
  • Cost: In an on-premises data processing environment, the lion’s share of your costs are the upfront capital expenditures required to build your infrastructure. In the cloud, your costs are monthly bills based primarily on resource consumption and the types of services you use. This means cost structures and the strategies for optimizing costs are very different for cloud-based data workloads.
  • Security: Cloud security is a discipline unto itself. Depending on which types of services you use in the cloud and how much control you have over the physical infrastructure (typically, you have minimal control), your security strategy for cloud workloads may look very different from the one you use to secure on-premises infrastructure.

Working Effectively With Cloud-Based Data Workloads

How do you manage the special challenges that cloud-based data workloads impose? Following are some tips.

Perform a Migration Assessment

As noted above, data migration into the cloud is a common pain point. Given this, and the fact that your migration strategy lays the foundation for the ongoing success of your data workload, it’s important to perform a migration assessment before beginning your migration.

A migration assessment involves determining not only how to get your data into the cloud, but also how you will adapt your data architectures and strategy to fit the cloud. Will you simply lift and shift data into the cloud, keeping the same general architecture in place? Or will you modernize your data workloads by taking advantage of new technologies you didn’t use in your on-premises environment?

The major cloud vendors offer a suite of tools to help answer these questions:

Use the Network Effectively

The network can be the biggest bottleneck in a data pipeline. Given the centrality of networks to the cloud, it’s therefore important to ensure you leverage the network efficiently.

One way to do that (especially during the data migration stage) is to take a provisioning approach, which involves moving data gradually into the cloud and provisioning cloud infrastructure to fit each specific data workload in a cost-efficient way. This can be more cost-effective than lifting and shifting all of your data into the cloud at once, which can leave you with hefty data migration fees.

Replicate Cloud Data Intelligently

One of the best ways to maximize data availability in the cloud is to replicate it across multiple cloud data centers and/or regions. Doing so maximizes the chances of keeping your data available in the event that one cloud data center fails, which is rare but does happen.

Of course, the more data replication you use in the cloud, the more you will pay, so you need to balance your replication strategy with your budget.

Used Tiered Storage Effectively

Most cloud vendors offer different tiers of storage. The default tier is designed for data that needs to be accessed on a frequent basis, but you can save money by choosing tiers designed for infrequently accessed data or archival storage. The price you pay for using lower-cost storage tiers is slower data access time, but in cases where you don’t need to access data quickly or frequently, low-cost tiers can do a lot for your budget.

Access Control, Access Control, Access Control

In the cloud, you often have little control over underlying infrastructure or software environments. You can’t harden the host operating system for your cloud servers and the metrics you can feed from a cloud environment into your SIEM are often limited.

What you can do to help secure most cloud environments, however, is to set strict access controls. Configuring IAM policies may not be most engineers’ idea of a good time, but it’s critical to perform this tedious work to help secure cloud-based data workloads.

Backup

When you move data workloads to the cloud, it can be tempting to assume you don’t need to back them up, because the cloud never disappears.

It’s true, the cloud is very resilient, but that doesn’t eliminate the need for data backups. There is always a chance—however small—that cloud-based data could be lost permanently. You also face the (less uncommon) risk of having your cloud-based data become corrupted or infected with malware—in which case it’s useful to have clean backup copies from which you can restore.

So, back up your cloud-based data, too. As a best practice, it’s wise to follow a 3-2-1 backup strategy, which entails keeping multiple backups of your data in isolated locations.

Conclusion

The cloud is a fundamentally different beast than your on-premises infrastructure. Making the most of cloud-based data requires you to rethink your strategies for data migration, availability, security and more in certain respects. You need not discard everything you know about on-premises data management, but you need to adjust your operations to address the special challenges of the cloud.

This sponsored article was written on behalf of Unravel.

Chris Tozzi

Christopher Tozzi has covered technology and business news for nearly a decade, specializing in open source, containers, big data, networking and security. He is currently Senior Editor and DevOps Analyst with Fixate.io and Sweetcode.io.

Recent Posts

The Battle for End-to-End DevOps Champion Is On

This past week's CollabNet/Xebia Labs merger brings another challenger to the "end-to-end" DevOps market. There is a huge prize waiting…

10 hours ago

The Risks and Potential Impacts Associated with Open Source

Open source software (OSS) is built by communities of developers who contribute their knowledge and time to OSS projects they…

11 hours ago

Pendulums and DevOps

I have long noted the trend of pendulums in IT, particularly in organizations with longer histories. Centralized IT will be…

12 hours ago

GitLab Updates Core CI/CD Platform

GitLab this week delivered an update to its namesake continuous integration/continuous delivery platform that promises to make pipelines more efficient…

3 days ago

Unravel Data Platform Earns SOC 2 Certification from the American Institute of CPAs

PALO ALTO, Calif. – January 21, 2020 – Unravel Data, a data operations platform providing full-stack visibility and AI-powered recommendations to drive more…

3 days ago