Parsing

Scanning

The scanner takes in raw source code as a series of characters and groups it into a series of chunks we call tokens.

Is scanning required for every parser?
It's definitely possible. But, it is better to do scanning first. Otherwise, you would simple be writing scanner code inside the parser anyways.

Lexemes and Tokens

The rules that determine how a particular language groups characters into lexemes are called its lexical grammar.

Our job is to scan through the list of characters and group them together into the smallest sequences that still represent something. Each of these blobs of characters is called a lexeme.

The lexemes are only the raw substrings of the source code. When we take the lexeme and bundle it together with other useful data, the result is a token.

class Token {
  final TokenType type;
  final String lexeme;
  final Object literal;
  final int line; 

  Token(TokenType type, String lexeme, Object literal, int line) {
    this.type = type;
    this.lexeme = lexeme;
    this.literal = literal;
    this.line = line;
  }

  public String toString() {
    return type + ' ' + lexeme + ' ' + literal;
  }
}

Error Recovery

We want to report as many separate errors as we can, but we don’t want to report ones that are merely side effects of an earlier one.

Panic mode

As soon as the parser detects an error, it enters panic mode. It knows at least one token doesn’t make sense given its current state in the middle of some stack of grammar productions.

Before it can get back to parsing, it needs to get its state and the sequence of forthcoming tokens aligned such that the next token does match the rule being parsed. This process is called synchronization.

To do that, we select some rule in the grammar that will mark the synchronization point. The parser fixes its parsing state by jumping out of any nested productions until it gets back to that rule. Then it synchronizes the token stream by discarding tokens until it reaches one that can appear at that point in the rule.

Any additional real syntax errors hiding in those discarded tokens aren’t reported, but it also means that any mistaken cascaded errors that are side effects of the initial error aren’t falsely reported either, which is a decent trade-off.

The traditional place in the grammar to synchronize is between statements.

Ambiguity

When parsing, ambiguity means the parser may misunderstand the user’s code. As we parse, we aren’t just determining if the string is valid Lox code, we’re also tracking which rules match which parts of it so that we know what part of the language each token belongs to.

In other words, the grammar allows seeing the expression as (6 / 3) - 1 or 6 / (3 - 1).

The way mathematicians have addressed this ambiguity since blackboards were first invented is by defining rules for precedence and associativity.

Precedence determines which operator is evaluated first in an expression containing a mixture of different operators. Equivalently, higher precedence operators are said to “bind tighter”.

Associativity determines which operator is evaluated first in a series of the same operator.

In principle, it doesn’t matter whether you treat multiplication as left- or right-associative—you get the same result either way. Alas, in the real world with limited precision, roundoff and overflow mean that associativity can affect the result of a sequence of multiplications. Consider:

print 0.1 * (0.2 * 0.3);
print (0.1 * 0.2) * 0.3;

In languages like Lox that use IEEE 754 double-precision floating-point numbers, the first evaluates to 0.006, while the second yields 0.006000000000000001.

Recursive Descent

Recursive descent is considered a top-down parser because it starts from the top or outermost grammar rule (here expression) and works its way down into the nested sub-expressions before finally reaching the leaves of the syntax tree.

A recursive descent parser is a literal translation of the grammar’s rules straight into imperative code. Each rule becomes a function. The body of the rule translates to code roughly like:

| Grammar notation | Code representation | |------------------|-----------------------------------| | Terminal | Code to match and consume a token | | Non-terminal | Call to that rule’s function | | | | if or switch statement | | * or + | while or for loop | | ? | if statement |

The descent is described as “recursive” because when a grammar rule refers to itself—directly or indirectly—that translates to a recursive function call.

Precedence is baked into the grammar. Associativity is handled by being left recursive or right recursive.

Each rule matches itself and operator with higher precedence.

expression     →  equality ;
equality       →  comparison ( ( '!=' | '==' ) comparison )* ;
comparison     →  term ( ( '>' | '>=' | '<' | '<=' ) term )* ;
term           →  factor ( ( '-' | '+' ) factor )* ;
factor         →  unary ( ( '/' | '*' ) unary )* ;
unary          →  ( '!' | '-' ) unary | primary ;
primary        →  NUMBER | STRING | 'true' | 'false' | 'nil' | '(' expression ')' ;

Left factoring

Parsing left associative expression requires left recursion.

The basic problem is that with a recursive descent parser, left-recursion causes instant death by stack overflow as it will keep on invoking itself before invoking other nodes.

Left-factored grammar creates an ambiguous situation for top-down parsers. Top-down parsers cannot choose the production to derive the given string since multiple productions will have a common prefix. This creates an ambiguous situation for the top-down parser.

To tackle this situation, we need to convert the left-factored grammar into an equivalent grammar without any productions having a common prefix. This is done using left factoring in Compiler design.

Predictive Parser

Predictive parser is one that doesn't need to do any backtracking to parse the language. The languages parsed by predictive recursive descent parsers are called LL(𝑘) grammars, where the 𝑘 can be thought of as the number of lookahead token you need to parse the grammar predictively.

Pratt Parsing

Uses a table for each token type that contains,

the function to compile a prefix expression starting with a token of that type,
the function to compile an infix expression whose left operand is followed by a token of that type, and
the precedence of an infix expression that uses that token as an operator.

parsePrecedence(precedence) method will parse subsequent tokens that have higher precedence than passed precedence.

Left associativity is handled by passing precedence + 1.

Algorithm goes like this,

First is always a prefix expression. So, call prefix function for previous token type.
Until precedence of current is >= passed precedence, use infix function from table for current token.