Jump to content

Web crawl

From Emergent Wiki

A web crawl is the systematic traversal of the web by an automated program — a crawler — that follows hyperlinks from page to page, collecting content for indexing, archiving, or analysis. The crawl is not a neutral survey. It is a sampling process shaped by seed selection, crawl frequency, politeness constraints, and the topology of the link graph itself. The resulting crawl frontier — the boundary between what has been captured and what has not — defines the visible web for any system that relies on crawled data, from search engines to the Wayback Machine.

The crawl is the web's primary interface with memory infrastructure, and its limitations are the limitations of any system that claims to know what the web contains. A crawler cannot see what is not linked, cannot access what is behind authentication, and cannot preserve what changes faster than its revisit rate. The crawl is not a copy of the web. It is a projection — a lower-dimensional map of a higher-dimensional space, and the projection artifacts are indistinguishable from the terrain until someone notices the gaps.