My learnings being a Software Engineer: HDFS Architecture

HDFS design goals can be found here:
HDFS Design Goals

HDFS stands for Hadoop Distributed File System
HDFS is designed to run on low-cost hardware
HDFS is highly fault-tolerance(as it supports block replication)
HDFS was originally built as infrastructure for the Apache Nutch web search engine project
HDFS is now an Apache Hadoop sub project
An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system's data
HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access
A typical file in HDFS is gigabytes to terabytes in size
HDFS applications need a write-once-read-many access model for files. This assumption simplifies data coherency issues and enables high throughput data access

Architecture:

HDFS is built using the Java language
HDFS has master/slave architecture: NameNode/DataNode
An HDFS cluster consists of a single NameNode and a number of DataNodes
The NameNode and DataNode are pieces of software designed to run on commodity machines. These machines typically run a GNU/Linux operating system (OS)
NameNode is a master server that manages the file system namespace and regulates access to files by clients. It executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. It periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly.
DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. They are responsible for serving read and write requests from the file system’s clients. They also perform block creation, deletion, and replication upon instruction from the NameNode.
HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes

Conclusion: The existence of a single NameNode in a cluster greatly simplifies the architecture of the system. The NameNode is the arbitrator and repository for all HDFS metadata. The system is designed in such a way that user data never flows through the NameNode

Frequently Asked Questions:

Q) What kind of file organization does HDFS supports?
A) Traditional hierarchical file organization

Q) Does HDFS supports user quotas?
A) No

Q) Does HDFS supports soft links or hard links?
A) No

Q) Define replication factor. Who will store this information?
A) The number of copies of a file is called the replication factor of that file. This information is stored by the NameNode

Q) How does HDFS stores a file?
A) HDFS stores each file as a sequence of blocks; all blocks in a file except the last block are the same size

Q) What is the default block size?
A) 64MB

Q) What is the default replication factor?
A) 3

Q) What does a block report contains?
A) A Blockreport contains a list of all blocks on a DataNode

Q) Is data locality principle applicable to HDFS?
A) Yes, when loading data from a DataNode's local drive into HDFS, the NameNode will schedule one copy to go into the local DataNode, and will pick two other machines at random from the cluster(preferably, second replica on a different DataNode on the same rack, third replica could be on a random DataNode on a different rack)

Note: More questions will be added soon....

Reference: Hadoop Official website

Wednesday, August 7, 2013

HDFS Architecture