Sunday, April 5, 2020

GHC: How whitespace sensitive operator lexing works

How whitespace sensitive operator lexing works

In GHC, Haskell operator occurrences get classified into one of four categories. For example, the occurrence of ⊕ in a ⊕ b is "loose infix", in a⊕b is "tight infix", in a ⊕b is "prefix" and in a⊕ b, "suffix"

The point of this is that certain operators can be ascribed different meanings depending on the classification of their occurrence and language extensions that may be in effect. For example, ! when encountered will lex as strictness annotation (token type ITbang) if its occurrence is prefix (e.g. f !x = rhs) or an ordinary operator (token type ITvarsym ) if not (e.g. xs ! 3). Another ready example is provided by operator @ which, according to whitespace considerations, may be a type application (prefix), an as-pattern (tight infix), an ordinary operator (loose infix) or a parse error (suffix).

The implementation of this categorization relies upon two functions: followedByOpeningToken and precededByClosingToken. To explain further:

  • Identifiers, literals and opening brackets (, (#, [|, [||, [p|, [t|, { are considered "opening tokens";
  • Identifiers, literals and closing brackets ), #), ], |], } are considered "closing tokens";
  • Other tokens and whitespace are considered neither opening or closing.

The classification algorithm is defined by the following rules:

TrueTruetight infix
FalseFalseloose infix

The implementation of precededByClosingToken is very straightforward: look backwards one character in the lexing buffer.
precededByClosingToken :: AlexAccPred ExtsBitmap
precededByClosingToken _ (AI _ buf) _ _ =
  case prevChar buf '\n' of
    '}' -> decodePrevNChars 1 buf /= "-"
    ')' -> True
    ']' -> True
    '\"' -> True
    '\'' -> True
    '_' -> True
    c -> isAlphaNum c
Similarly, followedByOpeningToken: look forwards one character in the lexing buffer.
followedByOpeningToken :: AlexAccPred ExtsBitmap
followedByOpeningToken _ _ _ (AI _ buf)
  | atEnd buf = False
  | otherwise =
      case nextChar buf of
        ('{', buf') -> nextCharIsNot buf' (== '-')
        ('(', _) -> True
        ('[', _) -> True
        ('\"', _) -> True
        ('\'', _) -> True
        ('_', _) -> True
        (c, _) -> isAlphaNum c
Armed by these rules, the lexing of operators looks like this:
<0> {
  @varsym / { precededByClosingToken `alexAndPred` followedByOpeningToken } { varsym_tight_infix }
  @varsym / { followedByOpeningToken }  { varsym_prefix }
  @varsym / { precededByClosingToken }  { varsym_suffix }
  @varsym                               { varsym_loose_infix }

The actions varsym_tight_infix, varsym_prefix, varsym_suffix and varsym_loose_infix are "fed" the operator and allow for language extension specific issuance of tokens (as opposed to issuance of general ITvarsym tokens). For example, varsym_prefix :

varsym_prefix :: Action
varsym_prefix = sym $ \exts s ->
  if | TypeApplicationsBit `xtest` exts, s == fsLit "@"
     -> return ITtypeApp
     |  ...
     | otherwise -> return (ITvarsym s)