HDFS

Hadoop Distributed File System (HDFS) is a distributed, scalable file system designed to store very large files across clusters of commodity hardware. It is the storage layer of the Apache Hadoop ecosystem. HDFS splits files into large blocks (typically 128MB or 256MB) and replicates each block across multiple nodes for fault tolerance. Its design prioritizes throughput over latency — it is optimized for batch processing of large datasets, not for interactive query response. This design choice reflects a fundamental systems principle: when hardware is cheap and failures are certain, optimize for the aggregate case and let the outliers suffer. HDFS's master-worker architecture, with a single NameNode managing metadata and DataNodes storing blocks, has been both its greatest simplification and its most criticized bottleneck. Spark and other modern frameworks can read from HDFS, but the trend toward cloud object storage (S3, GCS) suggests that HDFS's tightly coupled storage-compute model may be a transitional architecture rather than an enduring one.