Fault Tolerance in HDFS

9 min readNov 23, 2020

Fault tolerance refers to the ability of the system to work or run even under adverse conditions (such as component failure). In this article, we will learn how HDFS in Hadoop achieves fault tolerance. First let us look at what big data is.

Introduction to Big Data

In the early years, organizations or applications were using only structured data i.e. in the form of rows and columns only. It was very simple to use relational databases (RDBMS) and traditional tools for storing, managing and processing this type of data.

However, the nature and rate of generation of data has drastically changed over the years. With the evolving technology, the data took an exponential increase in its volume and velocity. Systems and applications started generating huge amount of data in variety of formats at a very fast rate. This change resulted in un-structured and semi-structured data. At this point, it became difficult or near to impossible to use old technologies, traditional database and tools to store, manage, process or analyze this kind of data. This is when Big Data came into picture.

To define the term, Big Data is a technique to Store, Process, Manage, Analyze and Report a huge amount of variety data, at the required speed, and within the required time to allow Real-time Analysis and Reaction.

Issues in legacy systems

In traditional systems such as relational databases, all read and write operations are done on one machine. If any adverse conditions such as RAM crash, computer failure, hard disk failure, power failure, etc. occur, the user must wait until the problem is manually solved. Therefore, during a computer crash or malfunction, users will be unable to access their data until the error in the computer is recovered. Additionally, in the old systems, we can only store data in the GB range. So, in order to increase the data storage capacity, a new server computer must be purchased. Therefore, to store a large amount of data, several server computers must be added, which increases costs. The Hadoop distributed file system overcomes these problems. There are many systems that can be used to store and process big data, but instead of relying on any other expensive systems, HDFS is very convenient when processing as well as storing enormous data.

HDFS

HDFS is the main data storage system used by Hadoop applications. It can be deployed in low-cost software, and its architecture is in such a manner that it detects the faults and provide fast and automatic failure recovery. It uses a NameNode and DataNode architecture to implement a distributed file system that can provide high performance and efficient storage management.

Main components of HDFS

Apache HDFS is a block-structured file system in which each file is divided into blocks of predetermined size. These blocks are stored in a cluster of one or more systems. The Hadoop HDFS architecture follows a master/slave architecture, where the cluster consists of a single NameNode (master node), and all other nodes are DataNodes (slave nodes). Although one person can run multiple DataNodes on one machine, in actual applications, these DataNodes are distributed on various machines.

NameNode

NameNode is the master node in the Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNodes, NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients. The HDFS architecture is built in such a way that the user data never resides on the NameNode. The data resides on DataNodes only.

Functions of NameNode

NameNode is the master daemon that maintains and manages the DataNodes (slave nodes). It records the metadata of all the files stored in the cluster, e.g. the location of blocks stored, the size of the files, permissions, hierarchy, etc. There are two files associated with the metadata. First is FsImage which contains the complete state of the file system namespace since the start of the NameNode. And second is EditLogs which contains all the recent modifications made to the file system with respect to the most recent FsImage.

NameNode records each change that takes place to the file system metadata. For example, if a file is deleted in HDFS, the NameNode will immediately record this in the EditLog. It regularly receives a Heartbeat and a block report from all the DataNodes in the cluster to ensure that the DataNodes are live. It keeps a record of all the blocks in HDFS and in which nodes these blocks are located. The NameNode is also responsible to take care of the replication factor of all the blocks which we will discuss in detail later. In case of the DataNode failure, the NameNode chooses new DataNodes for new replicas, balance disk usage and manages the communication traffic to the DataNodes.

DataNode

DataNodes are the slave nodes in HDFS. Unlike NameNode, DataNode is a commodity hardware, that is, a non-expensive system which is not of high quality or high-availability. The DataNode is a block server that stores the data in the local file ext3 or ext4.

Functions of DataNode

DataNodes are slave daemons or process which runs on each slave machine. The actual data is stored on DataNodes. The DataNodes perform the low-level read and write requests from the file system’s clients. They send heartbeats to the NameNode periodically to report the overall health of HDFS, by default; this frequency is set to 3 seconds.

NameNode is of great importance. It is the single point of failure in Hadoop version 1. If the single NameNode fails, all communication will stop and the whole Hadoop cluster will stop working. To handle the single point of failure in Hadoop 1, another setup configuration is used which is used to backup NameNode metadata. If the primary NameNode fails, our setup can switch to secondary (backup) and no type of shutdown will happen in the Hadoop cluster.

HDFS High Availability of NameNode is introduced with Hadoop 2. In this, two separate machines have been configured as NameNodes, where one NameNode is always in working state and another is in standby state. Working Name node handles all clients request in the cluster where standby behaves as the slave and maintains enough state to provide a fast failover on Working Name node.

How HDFS achieves Fault Tolerance

Fault tolerance in Hadoop HDFS refers to the work intensity of the system under adverse conditions and how the system handles the situation. HDFS is extremely fault-tolerant. Before Hadoop 3, it handled failures through the process of replica creation. Later, Hadoop 3 introduced erasure coding to provide fault tolerance.

1) Replication Mechanism

Before Hadoop 3 came into play, fault tolerance in Hadoop HDFS was attained by creating replicas. HDFS creates copies of data blocks and stores them on multiple machines (data nodes). The number of copies created depends on the replication factor, which by default is 3. If any machine fails, the data block can be accessed from another computer that contains a copy of the same data. Therefore, there is no data loss because of replicas stored on different machines.

Example of HDFS Replication Mechanism

Suppose the user stores a file. HDFS breaks this file into blocks; say B1, B2, B3, B4, and B5. Let’s assume there are five Data Nodes, say D1, D2, D3, D4 and D5. HDFS creates replicas of each block and stores them on different nodes to achieve fault tolerance. For each original block, there will be three replicas stored on different nodes (replication factor 3).

Let the block 1 be stored on Data Nodes D1, D2, and D3, block 2 stored on Data Nodes D1, D4, and D5, and similarly the other 3 blocks.

If Data Node 1 fails, the blocks 1, 2 and 4 present in D1 are still available to the user from Data Nodes (D2, D3 for B1), (D4, D5 for B2) and (D2, D5 for B4). Hence even in unfavorable conditions, there is no data loss.

2) Erasure Coding

Erasure coding is a method for fault tolerance. Compared with replication, this method stores data persistently and saves a lot of space. RAID (Redundant Array of Independent Disks) uses erasure coding. Erasure coding works by dividing files into small pieces and storing them on various disks. For each strip of the original data set, a certain number of parity cells are calculated and stored. If any machine fails, the block can be recovered from the parity unit. Erasure coding reduces storage up to 50%.

There are two algorithms available in Erasure Coding:

I) XOR Algorithm (Simple EC Algorithm)

This is the simplest implementation of HDFS Erasure Coding. Let’s assume X and Y are data cells; the parity cell is the XOR of these two data cells.

Fig 7. Erasure Coding example using XOR Algorithm

Here, data durability is 1 as if can handle 1 simultaneous failure and storage efficiency is 75% (as we are using only one extra block, i.e. 3/4).

X ⊕ Y is XOR by which only one parity bit is generated and if any bit is lost, it can be recovered by the remaining data cells and a parity bit. It is very limited since it produces one parity bit, so XOR operations can tolerate only one failure with n group size but we get the benefit of better storage efficiency by using XOR algorithm.

II) Reed-Solomon Algorithm (Improved EC Algorithm)

The limitation of XOR operations is solved by improved EC algorithm, commonly known as the Reed-Solomon algorithm. Reed-Solomon uses linear algebra operations to generate multiple parity cells where, instead of getting only one fault tolerance at a time, we can tolerate multiple failures per group. It works by multiplying a Generator Matrix (GT) with d data cells to generate codeword with d data cells and p parity cells. In Reed-Solomon, fault tolerance is up to p, i.e. (number of parity cells) cells and storage efficiency is d/d+p where d is data cells and p is parity cells.

Fig 8. Erasure Coding example using Reed-Solomon Algorithm

In this particular example, when you look at the codeword, 6 (blue cells) are the actual data cells and 3 (red cells) are the parity cells, which are simply obtained by multiplying our data cells to generation matrix. Storage failure can be recovered by multiplying the inverse of generator matrix with the extended code words as long as k out of k+m cells are available. Therefore, data durability is 3 (as it can handle two simultaneous failures), storage efficiency is 67% (as we are using only one extra block, i.e. 6/9), and we only need to store half the number of cells compared to the original number of cells. We can conclude that we also have only 50% overhead in it.

Conclusion

In spite of some of the failures and breakdown of NameNode, Hadoop provides better way of handling fault tolerance. In this article, firstly, we saw the factors which resulted in the emergence of big data and the drawbacks of the old technology for storing and processing it. After that, we have seen the main components of HDFS and its functions. We have enlisted two different mechanisms to achieve fault tolerance in Hadoop and also explained them with suitable examples. We also discussed the advantages of Erasure Coding over Replication Mechanism. Fault tolerance is of great importance for Big Data Systems.