My learnings being a Software Engineer: HDFS Design

Hadoop Distributed File System(HDFS) is part of the Apache Hadoop Core project. It is a fault-tolerant distributed file system designed to run on low-cost hardware. It has many similarities with existing distributed file systems. However, it relaxes a few POSIX requirements to enable streaming access to file system data.

On the other side, it provides high throughput(as it offers parallel processing) access to application data and is suitable for applications that have large data sets.

Note: It was originally built as infrastructure for the Apache Nutch web search engine project.

Assumptions and Goals:

Hardware Failure: Hardware failure is a norm rather than the exception. An HDFS instance may consists of hundreds or thousands of server machines, each sorting part of the file system's data. The fact that there are a huge number of components and that each component has a non-trivial probability of failure means that some component of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.
Streaming Data Access: Applications that run on HDFS need streaming access to their data sets. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access.
Large Data Sets: Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance.
Simple Coherency Model: HDFS applications need a write-once-read-may access model for files. A file once created, written and closed need not be changed. This assumption simplifies data coherency issues and enables high throughput data access.
Moving Computation is Cheaper than Moving Data: A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of the data set is huge. This minimizes network congestion and increases
Portability Across Heterogeneous Hardware and Software Platforms

Reference: Hadoop Official Documentation

Friday, June 7, 2013

HDFS Design