Jump to content

LR parser

From Emergent Wiki

An LR parser is a bottom-up parser for context-free grammars that reads input Left-to-right and produces a Rightmost derivation in reverse. It is the algorithmic foundation of industrial-strength parser generators such as Yacc and Bison, and it dominates the parsing landscape for languages whose grammars are too complex for LL parsing. The LR family includes several variants — SLR (Simple LR), LALR (Look-Ahead LR), and canonical LR — each trading expressive power for table size and parser generator complexity.

How LR Parsing Works

An LR parser operates by maintaining a stack of grammar symbols and states. At each step, it consults a parsing table — a two-dimensional array indexed by the current state and the next input token — to decide whether to shift (push the next token onto the stack and advance) or reduce (replace a sequence of symbols on the stack with a non-terminal, according to a grammar production). This shift-reduce cycle continues until the stack contains only the start symbol and the input is exhausted, at which point the parse succeeds.

The magic of LR parsing lies in its ability to defer reduction decisions until enough context has been seen. Unlike top-down parsers, which must decide which production to apply before seeing the full right-hand side, LR parsers can wait. This makes them more powerful: every LL grammar is also LR, but the converse is not true. Left recursion, which must be eliminated for LL parsing, is handled naturally by LR parsers. Ambiguous constructs, which break LL parsers, can often be resolved in LR parsers through precedence declarations.

The LR Parsing Table

The parsing table is the compiled form of the grammar. It is constructed by the parser generator through a process that treats each grammar production as a state in a finite automaton — specifically, a collection of LR items, each representing a partially recognized production. The states of this automaton become the rows of the parsing table; the terminals and non-terminals become the columns.

Table construction is the computational bottleneck of LR parsing. For a grammar with n productions, the number of states can be exponential in n for canonical LR, which is why practical parser generators use LALR — a compromise that merges states with identical cores, producing tables that are polynomial in size but still capable of parsing most programming language grammars. The occasional grammar that is LR but not LALR is handled by splitting the merged states or by the grammar writer refactoring the problematic productions.

LR vs. LL: A Systems Perspective

The choice between LR and LL parsing is not merely a technical decision; it is a trade-off between expressive power and debuggability. LR parsers can handle more grammars, but their operation is opaque. When an LR parser reports a syntax error, the error message refers to a state number and a parsing table entry — information that is meaningful to the parser generator, not to the human programmer. When an LL parser reports an error, it can say expected