Enterprises rarely think about cloud portability issues before transferring large amounts of data onto a cloud service provider such as Amazon AWS, Microsoft Azure or Google Cloud Platform. They were likely to ask questions on usage by other players in the industry (“the herd effect”) as well as the reliability of the cloud service and durability of data, among other things. Enterprises are starting to realize that as they store increasing amounts of data onto the cloud, it is easy to import data but difficult to get data out. While it is true that sophisticated users may be able to solve the problem, most run the risk of locking in their databases—the “crown jewels”—with the cloud provider.
A common solution to the cloud portability issue is to develop an abstraction layer. Having an abstraction layer in place between an application and the cloud data service helps by minimizing the amount of code restructuring that would be required should an enterprise wish to migrate to a new cloud service. One example is the Amazon S3 API: The Google Cloud Framework has developed S3-compliant API to make it easy for enterprises to migrate their applications. The open-source community is also working to make it seamless; for example, the Python Boto library supports both the Amazon S3 service as well as the Google Cloud Storage service under the same abstraction layer.
However, while an abstraction layer solves the application migration issue, what about the “data” itself? How easy would it be to migrate swathes of data from one cloud service provider to another? Let us do some simple mathematics on this subject. For example, let us assume you would like to migrate 10TB of bucket data from Amazon S3 to Google Cloud Storage. How long do you think it will take? The answer would surprise you—almost a month, by reasonable estimates! The sheer operational cost of managing the data transfer over a month would dwarf the storage and network bandwidth charges accrued in both cloud providers. It might be cheaper to transfer the contents of an Amazon S3 bucket onto a local hard drive, ship the drive to Google and ask the company to upload the drive contents onto a Google Cloud Storage bucket.
How would someone solve the multi-cloud data lock-in problem? Would you:
- Run copies of applications on multiple clouds and keep data in sync?
- Maintain one cloud as the primary and other clouds as passive secondaries?
- Trust one cloud for Tier 0 applications and other cloud for Tier 2 or Tier 3 applications?
One of the key issues is that of the data transfer time. We use intelligent change capture techniques to minimize the amount of data sent over the WAN links that connect multiple cloud providers. For example, take a version of an Apache Cassandra database running in AWS S3, store the version in Google Cloud Storage and restore the version to an Apache Cassandra database running in Microsoft Azure. This is the future of data protection: Not only does it allow you to protect against user errors and logical corruptions, but it also gives you protection across clouds to provide insurance against lock-in onto a single cloud service provider.
Application Recovery Management: The Rise of Polyglot Persistence
Enterprises are using a variety of different data stores to capture the needs of a varying set of applications and their access patterns. Relational data stores might be the right choice for multi-faceted normalized data, but data stores today have to cater to a varying degree of access patterns—for example, a relational database cannot keep up with the high volume of a social media stream, and, likewise, a NoSQL data store cannot perform joins efficiently on a normalized data set. This term is referred to as “polyglot” persistence, wherein composite applications are sprayed over multiple data stores for varying types of data and methods of data manipulation. In the earlier days, data architects chose a relational data store such as Oracle and then mapped the application to this data store. Today, the landscape has changed significantly: Data architects first classify the data types and expected manipulation methods and then choose the appropriate data store to fit the needs.
Let us use the example in which an enterprise needs to identify all the customers in the past year, their purchase characteristics, the customer acquisition method, the social media comments on the post-purchase experience as well as any support interactions. Remember, that in today’s information age, the processing has to be in real time: it would be very imprudent for a business to react to a negative social media reference months after it happened. As one can guess, multiple sources of data with varying degrees of structure need to be collected and analyzed to meet these basic requirements of the enterprise. This type of problem cannot be solved easily or cost-effectively with one type of database technology. Even though some of the basic information is transactional and probably in a relational data store, the other information is non-relational and will require a few different types of persistence engines: document stores for static images and text collateral, spatial for geo-locating mobile customer interactions and graph for deriving relations between the multi-faceted data described above.
But has data protection kept up? Remember that the data is highly correlated with complex interdependencies between the data stores. A logical error in one data store is likely to propagate to others, so recovery has to span the entire spectrum of polyglot persistence. Today’s data protection is highly siloed with recovery limited to a single type of data store. This is clearly not enough!! What one has to do is to synchronize versioning across multiple data stores according to the needs of the application—this is an application consistent polyglot version. The key advantage of this version is to allow us to recover an entire application rather than sub parts of it.
Conclusion
It is time to rethink and reinvent data protection. The world of enterprise IT is experiencing a massive change as cloud is becoming the de-facto infrastructure of choice and multi-faceted big data is becoming the hallmark of the next generation of applications. Data protection must adapt to these changing environments and deliver the next generation of services to the enterprise.