Oleksiy Mark - Fotolia

Data protection for containers: Why and how to do Docker backup

Containers such as those from Docker are agile, lightweight and can be short-lived, but they and their data often need to be protected. We look at the key options available

This article can also be found in the Premium Editorial Download: Computer Weekly: Retailers buy into AI – the rise of artificial intelligence in retail

Containers are a great way to run applications, with much less overhead than traditional bare metal or virtualised environments. But what about data protection? Do containers need backup and data protection? The answer is yes – and no. In this article, we will look at the possible ways we can back up containers and their data, as well as products available that can help.

Containers have been around for many years, but the use of container technology has been popularised in the last five years by Docker.

The Docker platform provides a framework to create, configure and launch applications in a much simpler way than in the native features of the Linux and Windows operating systems on which they run.

An application is a set of binary files that run on top of an operating system. The application makes calls via the operating system to read and write data to persistent storage or to respond to requests from across the network. Over the past 15 years, the typical method of application deployment has been to run applications within a virtual machine (VM).

VMs take effort to build and manage. They need patching and have to be upgraded. Virtual machines can attract licensing charges, such as operating system licences and application licences per VM, so have to be managed efficiently.

Containers provide a much more lightweight way to run applications. Rather than dedicate an entire VM for each application, containers allow multiple applications to run on the same operating system instance, and these are isolated from each other by segregating the set of processes that make up each application.

Containers were designed to run microservices, be short-lived and not require persistent storage. Data resiliency was meant to be handled by the application, but in practice, this has proved impractical. As a result, containers can now be easily launched with persistent storage volumes or made to work with other forms of shared storage. 

Container data protection

A container is started from a container image that contains the binary files needed to run the application. At launch, time parameters can be passed to the container to configure components such as databases or network ports. This includes attaching persistent data volumes to the container or mapping file shares.

In the world of virtual machines, the VM and the data are backed up. Backup of a virtual machine is for convenience and other potential uses. So, for example, if the VM is corrupted or individual files are deleted they can be recovered.

Alternatively, the whole VM and its data can be brought back quickly. In practice though, with a well configured system, it may be quicker to rebuild the VM from a gold master and configure it using automation or scripts.

With containers, rebuilding the application from code is even quicker, making it unnecessary to backup the container itself. In fact, because of the way containers are started by platforms such as Docker, the effort to recover a container backup would probably be much greater than simply restarting a new container image. The platform simply isn’t designed to recover pre-existing containers.

So, while a running container instance doesn’t need to be backed up, the base image and configuration data does. Without this the application can’t be restarted.

Equally, this applies to implementing a disaster recovery strategy. Restarting an application elsewhere (eg, in the public cloud or another datacentre) also needs access to the container image and runtime configuration. These components need to be highly available and replicated or accessible across locations. 

Application data

Containers provide multiple ways to store application states or data.

At the simplest level – using Docker as an example – data can be written to the root file system (not a good idea) or stored in a Docker volume on the host running the container. It’s possible that this host could also be a virtual machine.

A Docker volume is a directory on the root file system of the host that runs the container. It’s possible to backup and restore this data into a running container, but this isn’t a practical solution or easy to manage when containers can run on many hosts.

It would be very hard to keep track of where a container was running at any one time to know which server to use for recovery. Backup software isn't aware of the container itself, just a set of directories.

Other alternatives

One is to use another file system on the host that has been structured to match the application. Rather than having directories named using random GUIDs, directory names can match application components.

So, when a container is started a directory is mapped into the container with a name that is consistent across container restarts and can easily be identified in traditional backup software.

This still doesn’t provide full recovery and disaster recovery in the event of a server loss.

Read more about containers and storage

In this instance, Docker and Kubernetes provide the capability to connect external storage to a container. The storage is provided by a shared array or software-defined storage solution that exists independently of any single host.

External storage provides two benefits:

  • Data protection can leverage the capabilities of an external array, such as snapshots or remote replication. This pushes persistence down to the storage and allows the container host to be effectively stateless.
  • Data can exist on an external device and be shared with traditional infrastructure like virtual machines. This provides a potential data migration route from VMs to containers for certain parts of an application.

Storage presented from shared storage could be block or file-based. In general, solutions offered by suppliers have favoured connecting block devices to a single container. For shared arrays, the process has been to mount a logical unit number (LUN) to the host, format it with a file system and then attach to the container. In Kubernetes, as an example, these volumes can be pre-existing or created on demand.

For software-defined storage solutions, many are natively integrated into the container orchestration platform to offer what look like file systems without the complexity and management configuration of external devices.

Solutions for container data protection

What are suppliers doing in this area? Docker provides a set of best practices for backup of the Docker infrastructure although this doesn’t cover application data. Meanwhile, Kubernetes uses etcd to manage state, so instructions are provided on the Kubernetes website on how to configure backups.

Existing backup suppliers are starting to offer container backup. Asigra was probably first to this in 2015. Commvault offers backup of data on container-based hosts.

Vendors including Pure Storage, HPE Nimble, HPE 3PAR, and NetApp all provide docker plugins to mount traditional LUNs to container infrastructure. This enables the capability to take snapshots at the array level for backup and to replicate the LUNs to other hardware if required.

Portworx, StorageOS, ScaleIO, Ceph and Gluster all offer native volumes for Kubernetes. These platforms also work with Docker and offer high availability from a clustering perspective and the ability to take backups via snapshots and replicas. Kubernetes is moving to support the Container Storage Interface, which should enable additional features like data protection to be added to the specification.

If containers are run within virtual machines then the VM itself could be backed up and individual files restored. However, if the backup solution isn’t container-aware, it may be very difficult to track down individual files unless they’ve been put into the structure already outlined above.

Cloud: A gap in container backup?

This discussion on container backup is focused on the deployment of containers in the datacentre. Public cloud represents a bigger challenge. As yet, solutions like AWS Fargate (container orchestration) don’t offer data persistence and are designed to be stateless.

This represents a potential operational gap when looking to move container workloads into the public cloud. As always, any solution needs to consider all deployment options, which could make the adoption of some public cloud features more difficult and push data management closer towards the developer.

Next Steps

Emphasize backup for container data protection

The reality of Docker disaster recovery in practice

Read more on Data protection, backup and archiving