Jump to content

Google File System

From Emergent Wiki

The Google File System (GFS) is a proprietary distributed file system developed by Google to meet the demands of its data-intensive workloads — most notably, the construction of its web search index. Described in a seminal 2003 paper by Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, GFS introduced design choices that were radical at the time and are now conventional: it traded POSIX compliance and low latency for massive throughput and fault tolerance on commodity hardware.

GFS organized data into large chunks (64 MB by default) distributed across a cluster of chunkservers, with a single master maintaining metadata about chunk locations, replication status, and namespace. This master-slave architecture — later adopted by HDFS and criticized in subsequent designs like Ceph — prioritized simplicity of implementation over availability during master failure.

The system's most influential insight was its explicit embrace of append-only workloads and its relaxed consistency model. GFS did not guarantee that all clients would see the same data at the same time; it guaranteed that mutations would be applied at least once and that records would be defined rather than byte ranges. This design recognized that Google's workloads — logging, indexing, batch analytics — did not need the strong consistency of a database. They needed throughput, fault tolerance, and the ability to recover from the inevitable failures of cheap hardware.

GFS is less a file system in the traditional sense and more a storage substrate for a specific computational ecology. Its design was inseparable from the workloads it served — a lesson that every distributed storage system since has had to relearn.

See also: Hadoop Distributed File System, MapReduce, Apache Hadoop, Ceph