Facebook shares lessons learnt from deploying Raid in HDFS clusters
Facebook deployed Raid in large HDFS clusters to increase capacity and reduce data replication, but it faced many challenges
Facebook deployed Raid in large Hadoop Distributed File System (HDFS) clusters last year, to increase capacity by tens of petabytes, as well as to reduce data replication. But the engineering team faced a lot of challenges during the process, such as data corruption and difficulties in implementing Raid across a large directory.
The social network implemented the technology, which includes Erasure Codes in HDFS, to reduce the replication factor of data in HDFS.
Raid (redundant array of independent disks) is a way of storing the same data in different places (thus, redundantly) on multiple hard disks. The Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications. It provides high-performance access to data across Hadoop clusters and has become a key tool in enterprises for managing big data and supporting big data analytics.
The default replication of a file in a Hadoop Distributed File System is three, which can lead to a lot of space overhead, according to Facebook’s engineering team. HDFS Raid technology helped the social network cut data replication and effectively reduce this space overhead.
“With Raid fully rolled out to our warehouse HDFS clusters last year, the cluster’s overall replication factor was reduced enough to represent tens of petabytes of capacity savings by the end of 2013,” the engineering team wrote on Facebook’s official blog.
With Raid fully rolled out to our warehouse HDFS clusters last year, the cluster’s overall replication factor was reduced enough to represent tens of petabytes of capacity savings by the end of 2013
But deploying Raid in large HDFS clusters with hundreds of petabytes of data presents a lot of challenges. “We wanted to share the lessons we learned along the way,” Facebook engineers said on the blog.
Space-saving lessons
When Facebook deployed Raid to production, it saved much less space than expected. “After investigating, we found a significant issue in Raid, which we christened the ‘small file problem',” the engineers said.
The engineering team found that files with 10 logical blocks provide the best opportunity for space-saving. The fewer blocks a file has, its space-saving capacity drops too. If a file has less than three blocks, Raid cannot save any space.
The team’s analysis found that more than 50% of the files in the production clusters were small (less than three blocks).
To overcome this problem, the IT team grouped blocks together. “We developed directory Raid to address the small file problem based on one simple observation: in Hive, files under the leaf directory are rarely changed after creation. So, why not treat the whole leaf directory as a file with many blocks and then Raid it?”
Preventing data corruption
Another challenge while using HDFS Raid is data corruption because of a bug in Raid reconstruction logic. To prevent corruption, the team calculated and stored CRC checksums of blocks into MySQL during the deployment of Raid so that whenever the system reconstructs a missing block, it compared the checksum with the one in MySQL to verify the data correctness, the team said.
Another challenge was that implementing Raid across a directory with more than 10,000 files would take a whole day to finish.
More on Raid and Hadoop
- Examining HDFS and NameNode in Hadoop architecture
- Big data storage: Hadoop storage basics
- How big data and Hadoop will change storage management
- Examining the future of the RAID volume
- Three key strategies to prevent RAID failure
“If there’s any failure during the Raid process, the whole process will fail, and the CPU time spent to that point will be wasted,” the engineers said. To overcome this issue, they “parallelised the Raid” using a mapper-only MapReduce job, where each mapper Raids only one portion of the directory.
“In this way, failure could be tolerated by simply retrying the affected mappers. With MapReduce, we are able to Raid a large directory in a few hours,” the team said.
Currently, at Facebook, the HDFS Raid technology is implemented as a separate layer on top of HDFS. But this is presenting a few problems for the team – such as wasting network bandwidth and disk I/Os. Also, when files are replicated across clusters, sometimes parity files are not moved together with their source files. This often leads to data loss.
So, Facebook engineers are building Raid to be natively supported in HDFS. “A file could be Raided when it is first created in HDFS, saving disk I/Os,” the engineers said.
“Once deployed, the NameNode will keep the file Raid information and schedule block-fixing when Raid blocks become missing, while the DataNode will be responsible for block reconstruction.
“This removes the HDFS Raid dependency on MapReduce,” they concluded.