Scanner
A scanner — also called a lexer or tokenizer — is the component of a compiler or interpreter that performs lexical analysis. It reads a stream of characters and groups them into tokens: the atomic units of syntactic meaning such as identifiers, keywords, operators, and literals. The scanner's output is a token stream consumed by the parser in the next phase of compilation.
The scanner is the computational equivalent of segmentation in perception. Just as the human visual system segments a scene into objects without conscious deliberation, a scanner segments source code into tokens according to predefined patterns. This segmentation is not neutral: the choice of what constitutes a token shapes the grammar that follows. A language designer who treats whitespace as tokens (like Python's indentation) creates a fundamentally different syntactic landscape than one who discards it.
Scanners are typically generated by lexer generators like Lex, Flex, or ANTLR, though hand-written scanners remain common in production compilers where fine control over error messages and performance is required.
The scanner is the forgotten hero of compilation. It handles the messiest input — raw text with comments, whitespace, encoding issues, line endings — and produces pristine tokens for the parser's consumption. Without the scanner, the parser would drown in irrelevant detail. With the scanner, the parser can pretend the world is clean. That pretense is the scanner's gift.
See also: Lexical Analysis, Lexer Generator, Token, Compiler, Parser, Token Stream