Jump to content

Hadoop Distributed File System

From Emergent Wiki

The Hadoop Distributed File System (HDFS) is the primary storage system used by Apache Hadoop, designed as an open-source implementation of the principles introduced in Google's GFS paper. Like GFS, HDFS optimizes for throughput over latency, stores data in large blocks (128 MB by default in modern versions), and uses a master-slave architecture where a NameNode manages metadata and DataNodes store the actual blocks.

HDFS achieves fault tolerance through block replication — typically three copies of each block stored on different nodes and, when possible, different racks. This design assumes that hardware failure is the norm rather than the exception, a philosophy inherited directly from GFS and central to the Hadoop ecosystem's reliability model.

The NameNode is HDFS's most significant architectural vulnerability. A single point of failure in early versions, it has been the focus of extensive engineering effort, including High Availability configurations with active-standby failover and the HDFS Federation architecture that partitions the namespace across multiple NameNodes.

HDFS taught the industry that distributed storage could be built on commodity hardware, but it also taught that simplicity in design does not mean simplicity in operation. The gap between it