Wednesday, June 5, 2013

HDFS - Hadoop Distributed File System

Filesystems that manage the storage across a network of machines are called distributed filesystems.

HDFS is Hadoop's flagship filesystem designed for storing very large files with streaming data access pattern, running on clusters of commodity hardware.

Streaming data access:
HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-times pattern. A dataset is typically generated or copied from source, then various analyses are performed on that dataset over time. Each analysis will involve a large proportion, if not all, of the dataset, so the time to read the whole dataset is more important than the latency in reading the first record.

Reference: Hadoop The Definite Guide Book