cookiecutter - Fotolia

Erasure coding vs RAID: Data protection in the cloud era

Erasure coding has come to prominence with its use in object storage and, by extension, in the cloud. We run the rule over its pros and cons vs RAID data protection

Since the late 1980s, RAID has been a fundamental method of data protection. But RAID arose in a world of SAN and NAS in hardware array products. Now the cloud and technologies such as object storage are in the ascendant and they are predominantly protected not by RAID, but by erasure coding.

Erasure coding promises to do away with the kind of lengthy rebuild times that come with large drive sizes and RAID. So, is erasure coding a potential replacement for RAID? We look at the pros and cons of erasure coding.

RAID recap

RAID virtualises multiple disks to form one logical drive. If one or more drives fail, the user can recover data by replacing the faulty drive and rebuilding the array. This provides robust data protection at relatively low cost.

But growing data volumes and developing technologies such as cloud and object storage have put pressure on conventional RAID technology.

Recovering a large RAID volume can be too time-consuming to be practical in a working business environment. Industry experts generally say any volume over 8TB will cause unacceptably slow rebuilds.

And conventional RAID backup can’t fully handle hyper-scale or hyper-converged distributed storage, where object storage suppliers, including cloud providers, keep data across multiple arrays in multiple physical locations. And RAID controllers add complexity.

Enter erasure coding

Erasure coding provides an answer, for very large datasets and for applications such as object and software-defined storage.

Erasure coding is parity-based, which means the data is broken into fragments and encoded, so it can be stored anywhere. This makes it well-suited to protecting cloud storage. Erasure coding also uses less storage capacity than RAID, and allows for data recovery if two or more parts of a storage system fail.

Erasure coding uses forward error correction. This is similar to the technology used in radio transmissions, including in GSM mobiles. Another way of looking at erasure coding is as a form of lossy compression, along the lines of the technology used to create an MP3 file or music CD. In these, if a piece of data is broken into 16 components, the original can be recovered using just 10.

Erasure coding explained

Erasure coding works by adding additional or redundant symbols to the dataset for protection; the total number of symbols is then broken up across drives or locations. The equation for erasure coding is N = K + M. K is the original amount of data or symbols, M is the extra or redundant symbols that provide protection from failures, and N is the total number of symbols after erasure coding. EC 10/16 means six extra symbols are added to the original data, but now any 10 can be used to recover them. The most common algorithm for EC is Reed Solomon.

This makes erasure coding more economical than RAID. As Bryan Betts, analyst at Freeform Dynamics, notes, the simplest form of EC uses “half codes” for each piece of data. So the coding adds a storage requirement of 50%.

And because the pieces of data can be anywhere, the system is potentially far more robust. A storage volume protected by erasure coding should be much less at risk of a hardware failure than one protected by RAID.

Recovery times will be quicker, too, depending on how the storage system is set up. Erasure-coded systems do not actually need to rebuild, so potentially there could be a failure without the user ever noticing, provided there are enough symbols to recreate the data. Rebuilding parity to new drives can happen in the background.

Erasure coding is not a backup

Erasure coding does have disadvantages, however. Chief among these is the processing overhead. Erasure coding is CPU-intensive. RAID, for its part, simply stores copies of the data on another disk or RAID stripe. The impact of the CPU load can also create latency. But this is not the only drawback.

“Architecturally, erasure coding can be more demanding on the system to calculate parity,” says Scott Sinclair at analyst ESG.

“It also important to understand that erasure coding is just one level of protection and doesn’t replace backup. It is just an efficient way to protect against hard drives or SSD failures.”

Erasure coding does not replace conventional backup, especially for on-premise systems. “They are totally different,” says Betts. “Backup means taking an independent second copy, preferably stored with an ‘air gap’. Just because your primary data is protected by erasure coding doesn’t stop it being corrupted or deleted accidentally or maliciously.”

Read more on erasure coding

Organisations still need backup to protect against threats such as ransomware.

Also, erasure coding is not a complete replacement for data replication. For companies that use erasure coding to protect data on their own premises – rather than through a cloud service – it is vital to consider how they would recover from a site failure.

Full off-site replication allows operations to restart from the failover site, but erasure coding does not provide a full copy of the data. ESG’s Sinclair recommends that businesses have a secondary copy of all production data, even if they use erasure coding.

Erasure coding can be set up as an alternative to off-site replication, but doing so needs careful planning.

IT directors need to know where the data elements are stored, and ensure they are in enough locations to allow recovery if one location suffers a total failure.

And distributed environments can affect performance, because of the decoding processor overhead.

Keep it in the cloud?

To date, erasure coding is mostly associated with object storage, and so with the cloud. It is seen as less well-suited to block or file. But erasure coding is now being used by NAS suppliers – NetApp deploys it in its StorageGRID – but it is also used in VMWare’s vSAN, Hadoop and Nutanix’s AOS.

Typically, erasure coding works in distributed systems designed to accept a certain level of latency, or where latency is not critical to the end-user. Nutanix, for example, recommends erasure coding for backups, archiving, WORM workloads and email, but not for write-intensive applications.

But for very largest datasets, erasure coding could be the only practical option to protect data.

“Object storage environments are typically too large to be able to do full backups regularly,” says Sinclair. “They need a protection technology that ensures higher levels of availability with the primary copy.

“Also, at large scale with large-capacity drives, traditional RAID rebuilds can often take too long, which can put data at risk if another failure happens during the rebuild.”

As a result, erasure coding looks set to play a growing role in the enterprise – but it is only one tool in the data protection toolbox.

Next Steps

AHCI vs. RAID: Features, differences and applications

Read more on Cloud storage