Lexical Analysis

Lexical analysis — also called scanning or tokenization — is the first phase of a compiler or interpreter. It transforms a stream of raw characters into a stream of tokens: meaningful units such as identifiers, keywords, operators, literals, and punctuation. The lexical analyzer, or lexer, sits at the boundary between the unstructured world of text and the structured world of syntax. Its job is to impose the first level of order on chaos.

Lexical analysis is deceptively simple. The rules that define tokens are regular languages, recognized by finite automata. An identifier is a letter followed by letters and digits. A number is a sequence of digits, optionally containing a decimal point. Comments begin with // and continue to the end of the line. These patterns are elementary, and the theory of regular expressions and finite automata provides complete, efficient algorithms for recognizing them. Yet the lexical analyzer is where many of the most bitter disagreements in language design occur: Should whitespace be significant? Should identifiers be case-sensitive? Should comments nest? Should the lexer distinguish between type names and ordinary identifiers?

The Lexer-Parser Boundary

The division of labor between lexical analysis and syntactic analysis — between the lexer and the parser — is a design decision with far-reaching consequences. In theory, the boundary is clean: the lexer recognizes regular languages, the parser recognizes context-free languages. In practice, the boundary is porous.

Consider the C preprocessor. In C, the lexer must maintain a symbol table to distinguish type names from identifiers, because the parser cannot resolve this ambiguity with context-free means alone. The lexer has become semantically aware, blurring the theoretical boundary. In Python, significant whitespace means the lexer must track indentation levels and inject INDENT and DEDENT tokens into the token stream. In languages with string interpolation, the lexer must switch between lexical modes as it encounters embedded expressions.

These are not edge cases. They are evidence that the regular-context-free boundary is a modeling convenience, not a natural kind. The lexer is not a mere preprocessor; it is an active participant in the construction of syntactic structure.

Lexical Analysis as Pattern Recognition

At its core, lexical analysis is pattern recognition at the character level. Modern lexers are typically generated from regular expression specifications by tools like Lex, Flex, or ANTLR. The specification describes each token type as a regular expression; the generator compiles these expressions into a deterministic finite automaton that recognizes the longest valid prefix of the input at each position.

This longest-match rule is crucial. If the input is 'ifx', the lexer must recognize 'ifx' as an identifier, not 'if' as a keyword followed by 'x' as an identifier. If the input is '123.456', the lexer must recognize it as a single floating-point literal, not as an integer '123' followed by '.' and another integer '456'. These disambiguation rules are not part of the formal grammar; they are part of the lexical convention.

Lexical analysis is where the rubber of formal language theory meets the road of human-readable syntax. The fact that we use regular expressions to specify tokens is not because programming language lexemes are intrinsically regular — it is because we have fast, well-understood algorithms for regular languages. The lexer is a pragmatic concession to computational complexity, and like all pragmatic concessions, it leaks. The boundary between lexer and parser is not a theorem; it is a treaty. And treaties are made to be contested.