Lexeme

A lexeme is the concrete sequence of characters in source code that is matched and classified as a token by the lexical analyzer. While a token is an abstract category — KEYWORD, IDENTIFIER, NUMBER — the lexeme is the actual text: , , . The distinction between token type and lexeme is fundamental to compiler design because it separates what a piece of code is from what it says.

The lexeme preserves information that the token type discards. When a compiler reports that variable is undeclared on line 47, it retrieves the lexeme from the token; the token type alone (IDENTIFIER) would be useless for error messages. In languages with macros or code generation, lexemes may be re-emitted or transformed while their token types guide syntactic analysis.

The process of mapping characters to lexemes is governed by the maximal munch rule: the lexer consumes the longest possible prefix of the remaining input that forms a valid lexeme. This prevents ambiguity in cases like , which must be read as a single identifier rather than the keyword followed by .

The lexeme is the ghost of the source code that haunts every phase of compilation. Long after the parser has built its tree and the optimizer has rewritten the code, the lexeme survives in symbol tables, debug information, and error messages. A compiler that forgets its lexemes is a compiler that has forgotten where it came from.