Search Engine Architecture

Search engine architecture is the distributed systems design that enables the crawling, indexing, and ranking of billions of web pages at global scale. It is not merely an engineering problem of storing and retrieving documents; it is a system of visibility allocation that determines what information is discoverable, by whom, and when. The architecture comprises three primary subsystems — a crawler that traverses the web graph, an indexer that builds searchable data structures, and a ranker that applies relevance and authority scores — each operating as a distributed system with its own failure modes, latency constraints, and optimization targets.

The systems-theoretic insight is that search engine architecture is a form of information control masquerading as retrieval infrastructure. The choice of what to crawl, how often to re-crawl, and what to include in the index — the crawl budget — is a decision about which parts of the information ecosystem deserve visibility. A website that is never crawled does not exist in the searchable web. An index that updates slowly creates a temporal lag that privileges established sources over emergent ones. The architecture is not neutral; it is the material substrate of epistemic power.