Recursive descent
Recursive descent parsing is a top-down parsing technique in which each non-terminal in a context-free grammar is implemented as a function that recognizes the corresponding syntactic construct. The parser descends through the grammar recursively: to parse an expression, it calls the expression function; to parse a term within the expression, it calls the term function; to parse a factor within the term, it calls the factor function. The call stack of the program mirrors the structure of the parse tree, making recursive descent one of the most intuitive and pedagogically valuable parsing techniques.
The technique is as old as structured programming. It was the standard approach for hand-written parsers before the advent of parser generators like yacc, and it has experienced a renaissance in modern compiler construction because it produces excellent error messages and is easy to modify for incremental or error-tolerant parsing.
How It Works
Consider a simple grammar for arithmetic expressions:
- Expression → Term { ('+' | '-') Term }
- Term → Factor { ('*' | '/') Factor }
- Factor → Number | '(' Expression ')'
A recursive descent parser implements each non-terminal as a function:
- parseExpression() calls parseTerm(), then checks for '+' or '-' and calls parseTerm() again if found.
- parseTerm() calls parseFactor(), then checks for '*' or '/' and calls parseFactor() again if found.
- parseFactor() checks if the current token is a number (and consumes it) or a '(' (in which case it calls parseExpression() and expects a matching ')').
The parser maintains a cursor pointing to the current token in the input stream. Each parsing function consumes tokens as it recognizes them and returns a node representing the parsed construct. If a function encounters a token it does not expect, it reports a syntax error.
Advantages
Recursive descent parsers have several practical advantages over table-driven approaches:
Clarity. The structure of the parser directly reflects the structure of the grammar. A programmer reading the parser can see exactly how each syntactic construct is recognized without consulting a parsing table or state machine.
Error messages. Because each parsing function knows what construct it is trying to recognize, it can produce precise error messages. A parseFactor() function that expects a number or left parenthesis can report "expected number or '('" rather than the generic "syntax error" produced by many table-driven parsers.
Flexibility. Hand-written recursive descent parsers can easily incorporate ad hoc rules that are difficult to express in formal grammars. A parser for a language with significant whitespace (like Python) or with context-sensitive features (like type-dependent name resolution) can be written as recursive descent with semantic predicates that would break a pure grammar-based approach.
Debugging. Because the parser is ordinary code, it can be debugged with ordinary tools. Setting a breakpoint in parseExpression() is trivial; setting a breakpoint in a yacc-generated state machine is not.
Limitations
The great limitation of recursive descent parsing is left recursion. A grammar production of the form A → A α is left-recursive: the non-terminal A appears as the leftmost symbol on the right-hand side. A naive recursive descent implementation would call parseA() at the start of parseA(), producing infinite recursion. Left recursion is common in programming language grammars: expression grammars typically write Expression → Expression + Term rather than the right-recursive equivalent Expression → Term + Expression.
Left recursion can be eliminated by grammar transformation. The left-recursive production A → A α | β can be rewritten as the right-recursive equivalent A → β { α }, which a recursive descent parser can handle by iteration rather than recursion. However, this transformation changes the shape of the parse tree and may require post-processing to recover the left-associative structure.
Another limitation is backtracking. A naive recursive descent parser that tries alternatives in sequence and backtracks on failure can exhibit exponential time complexity in the worst case. Predictive parsers avoid backtracking by using lookahead: they examine the next k tokens to decide which alternative to pursue, guaranteeing linear time. A grammar that permits this decision without backtracking is called an LL(k) grammar.
Recursive Descent in Practice
Despite the rise of parser generators, recursive descent remains the dominant technique for production compilers. The parsers for GCC, Clang, Rust, Go, and Swift are all recursive descent or predominantly recursive descent. The reason is not nostalgia. It is that recursive descent parsers produce better error messages, are easier to maintain, and can be incrementally adapted to language evolution in ways that generated parsers cannot.
The Rust compiler's parser is a particularly instructive example. Rust's syntax includes features (like the turbofish operator ::<>) that are context-sensitive and would be difficult to express in a pure context-free grammar. The hand-written recursive descent parser encodes these rules directly, using the full power of the host language to resolve ambiguities.
Recursive descent parsing is the technique that programming language designers reach for when they care about the developer experience more than the elegance of their formalism. It is the parser of pragmatists, not purists. The fact that every major modern language uses it is not a coincidence. It is a vote of no confidence in the table-driven orthodoxy, and it is the right vote.
See also: Parser, LL Parser, LR Parser, Compiler, Top-Down Parsing, Grammar, Context-Free Grammar, Left Recursion, Predictive Parsing, Abstract Syntax Tree