Denys Rudyi - Fotolia

Cloud data portability: Obstacles to it and how to achieve it

Cloud providers are very keen to get your data into their systems but sometimes it can be difficult or costly to move it elsewhere. We look at solutions to cloud portability

The public cloud promises significant benefits in moving from a capital-based and technology-focused IT strategy to one where costs are based on operational expense (opex) and the headache of managing infrastructure is offloaded to the service provider.

With the ubiquitous move to virtual machines – typically called “instances” in the public cloud – and with the impending adoption of container technology, applications have never been as mobile.

However, what good is an application without its data? To exploit cloud fully, we need data portability between clouds, whether on-premise or between public providers.

In this article, we take a look at some of the issues around data portability and how suppliers are helping to mitigate the problem of data mobility.

Moving applications to the public cloud offers IT organisations the ability to offload a major part of their day-to-day responsibilities for the management and deployment of infrastructure. The cloud provider carries out these tasks and provides IT services that hide the complexities of managing infrastructure from the customer.

Now IT organisations can be focused on the application, data and business value. The benefits of this model are realised in reduced costs, buying services (rather than infrastructure) as they are needed.

Of course, applications need data, and that data needs to be portable for a number of reasons, including: 

  • Resiliency – having data mobile between public and private clouds enables that data to be made accessible, even when a single cloud provider experiences technical problems.
  • Features – the features offered by cloud suppliers vary significantly. Google is well known for its suite of analytics tools. Other cloud providers offer unique features or might be attractive for compliance or regulatory reasons. Customers might not want to commit to a single provider just for their data offerings.
  • Cost – public cloud storage providers are pretty close to each other on cost, but with very large volumes of data, savings can be made by archiving to cheaper tiers of storage from another provider.
  • Exploiting cheap compute – cloud suppliers all offer spot instances; compute that is offered at a discount depending on factors like time of day. Data needs to be portable enough to be quickly available to use this cheap compute when it becomes available.

Cloud’s technical, operational and cost issues

So, data mobility is a good thing, but there are a number of issues that make ubiquitous data access a problem.

The first is that of simple physics. The speed of light imposes latency on input/output (I/O) operations, so we always want to make data as local as possible to the application. For cloud, this means having the data in the same cloud as the application.

If the application moves, the data must move. Moving data between clouds takes time and is restricted by throughput and bandwidth limitations. There’s also a cost factor.

Operationally, there are other issues to consider. Cloud storage providers use their own standards for data access and storage. There’s no direct compatibility between platforms, or even between cloud and on-premise, except for where suppliers have chosen to support de facto standards such as Amazon’s S3 application programming interface (API).

Cloud providers have no incentive to make it easy to take your data out of their platform, so we see pricing models that make it easy to get data into the cloud provider, but charge to take data out to switch to another provider.

Read more about cloud storage

  • Hybrid cloud storage optimises the opportunities provided by the cloud while recognising and working with its limitations.
  • Cloud storage provides cost savings and ease of management, but there are pitfalls – here are the top five things that can go wrong

Having said that, some cloud providers are starting to address the Egress charges, either reducing them for a small amount of monthly traffic or making them fixed rather than variable costs.

Even where data can be moved efficiently between cloud providers, the issue of data consistency has to be considered. If an application dataset takes two hours to copy to the cloud provider, that data may be inaccessible during that time, simply to ensure the data reaches the cloud provider in the same state it left. This means “cloudbursting” processes might not be as efficient as the industry would have us believe.

Part of the problem here is with the techniques used to get data into or between public clouds. With on-premise storage, replication is available to get data between geographically distant hardware, mainly for resiliency and disaster recovery purposes.

One benefit of replication is that data at the remote location is in sync, or very closely in sync, with the source, so moving operations to the disaster recovery site can be done with very little downtime.

Native cloud doesn’t provide the same level of flexibility as traditional storage – there’s no equivalent of array-based replication – which means other solutions at the virtual machine or application level are needed.

It’s also worth pointing out that array-based replication is a very static process, with replication source and targets fixed well in advance. Cloud, by contrast, is a much more dynamic infrastructure and customers may not know in advance where they want to run their application, especially with spot pricing.

Resolving cloud data mobility

What solutions exist to solve the data portability problem? Suppliers use a number of techniques:

Cachingcaching solutions extend a view of the data from wherever it permanently resides to wherever the compute/application is running. The actual caching techniques used include network-attached storage (NAS) caching from companies such as Avere and Primary Data, VM caching from Velostrata, and storage gateways from cloud providers such as Amazon Web Services (AWS) and StorSimple/Microsoft. Caches do a great job of getting data local to the application on a temporary basis, but don’t directly allow data to be permanently moved to the cloud

Global scale-out data – there are two types of solution here. There are global NAS offerings that provide a single geographically distributed file system available from each location in which applications run. These are available from a number of companies, including Panzura, Nasuni and Elastifile. There are also distributed storage platforms that provide object-based and block-based solutions across different geographies. Providers include Hedvig, Datera and NooBaa. With global file systems some degree of data integrity is provided through global file locking. Object solutions tend to be eventually consistent, whereas block-based solutions usually offer no data integrity protection and leave that job to the application.

Data protection – a number of solutions are based on use of data protection techniques to get data into and out of the cloud. These include replication from Zerto and time-based replication from Reduxio. There are also software-based solutions that can integrate with customer hardware infrastructure, such as HPE StoreVirtual and NetApp Cloud OnTap.

Private storage – at least three companies have started to offer private storage that sits close to the cloud provider in a co-located datacentre or even in the same physical building. Solutions are available from Zadara Storage, NetApp and Nimble Storage (now part of HPE). These products offer traditional features, including replication between hardware that can offer a degree of data portability, without the customer having to pay to deploy the hardware infrastructure.

Using application-based portability

The supplier offerings discussed here tend to fall into the category of infrastructure, but we should remember that application-based data portability can also be used.

NoSQL platforms such as MongoDB offer data mirroring between cloud instances. Replication can also be managed with traditional databases. Google recently introduced Spanner, a global distributed SQL database that can be accessed from any Spanner-enabled Google Cloud Platform presence that allows customers to build distributed applications without having to think about the issues of structured data portability.

What these solutions show is that today we see a patchwork of products that address specific application requirements or issues.

We haven’t yet seen the emergence of a single “data plane” to address many of the issues discussed in this article. Fierce competition means this kind of solution won’t come from the cloud provider, so users will have to continue to be responsible for managing their own data assets.

Read more on Cloud applications