Friday, April 9, 2021




The aim is to prove out the idea of an interpreter where the “front end” (lexical analysis) is in Rust and the “back end” (interpretation) is in C++.

  • We’ll parse and evaluate a simple language of arithmetic expressions;
    • The parser is implemented using nom (a Rust parser combinator library);
  • The “tagless-final” idiom is used to split the front and back ends;
  • The interop between C++ and Rust is expressed using the CXX library.


Additive expression syntax we’ll define by this grammar:

    expr := term ('+' term)*
    term := lit | '-' term | '(' expr ')'
    lit  := digits

The key idea of the program is this: Don’t define an abstract syntax tree type, values of which are produced by parsing. Rather, as parsing unfolds, call functions defined by the the folllowing trait.

pub trait ExprSyn: Clone {
    fn lit(n: i64) -> Self;
    fn neg(t: Self) -> Self;
    fn add(u: Self, v: Self) -> Self;

With that understood, we implement the parser as a Rust library in the following way.

pub mod parse {
    use nom::{
        character::complete::{digit1 as digit, space0 as space},
        combinator::{map, map_res},
        sequence::{delimited, pair, preceded},

    use super::ExprSyn;
    use std::str::FromStr;

    type ParseResult<'a, E> = IResult<&'a str, E>;

    fn lit<E: ExprSyn>(i: &str) -> ParseResult<E> {
        map_res(delimited(space, digit, space), |x| {

    fn neg<E: ExprSyn>(i: &str) -> ParseResult<E> {
        map(delimited(space, preceded(char('-'), term), space), E::neg)(i)

    fn par<E: ExprSyn>(i: &str) -> ParseResult<E> {
        delimited(space, delimited(tag("("), expr, tag(")")), space)(i)

    fn term<E: ExprSyn>(i: &str) -> ParseResult<E> {
        alt((lit, neg, par))(i)

    pub fn expr<E: ExprSyn>(i: &str) -> ParseResult<E> {
        let (i, init) = term(i)?;
        fold_many0(pair(char('+'), term), init, |acc, (_, val): (char, E)| {
            E::add(acc, val)

To wire that up to C++ we need express a CXX “bridge”.

use cxx::SharedPtr;

pub mod ffi {
    unsafe extern "C++" {
        type Cpp_repr;

        fn lit(i: i64) -> SharedPtr<Cpp_repr>;
        fn neg(e: SharedPtr<Cpp_repr>) -> SharedPtr<Cpp_repr>;
        fn add(l: SharedPtr<Cpp_repr>, r: SharedPtr<Cpp_repr>) -> SharedPtr<Cpp_repr>;

    extern "Rust" {
        fn parse(s: String) -> Result<SharedPtr<Cpp_repr>>;

The header file cpp_repr.hpp referenced by the bridge contains these prototypes.

#pragma once

#include <memory>
#include <cstdint>

struct Cpp_repr {
  int64_t expr;

using cpp_repr_t = std::shared_ptr<Cpp_repr>;

cpp_repr_t lit(int64_t i);
cpp_repr_t neg(cpp_repr_t t);
cpp_repr_t add(cpp_repr_t t, cpp_repr_t u);

The existence of that header is enough to finish off the Rust library.

pub type CppRepr_t = SharedPtr<ffi::Cpp_repr>;

impl ExprSyn for CppRepr_t {
    fn lit(i: i64) -> CppRepr_t {
    fn neg(t: CppRepr_t) -> CppRepr_t {
    fn add(t1: CppRepr_t, t2: CppRepr_t) -> CppRepr_t {
        ffi::add(t1, t2)

pub fn parse(s: String) -> Result<CppRepr_t, String> {
    match parse::expr::<CppRepr_t>(s.as_str()) {
        Ok((_s, rep)) => Ok(rep),
        Err(e) => Err(format!("{}", e)),


We put the C++ part of the implementation in cpp_repr.cpp in a separate library.

#include <iostream>

#include "arith_final_tagless/include/cpp_repr.hpp"

cpp_repr_t lit(int64_t i) {
  return std::shared_ptr<Cpp_repr>{new Cpp_repr{i}};

cpp_repr_t neg(cpp_repr_t t) {
  return std::shared_ptr<Cpp_repr>{new Cpp_repr{-t->expr}};

cpp_repr_t add(cpp_repr_t t, cpp_repr_t u) {
  return std::shared_ptr<Cpp_repr>{new Cpp_repr{t->expr + u->expr}};


The REPL in main.cpp brings the Rust and C++ libraries together into an executable.

#include <iostream>

#include "rust/cxx.h" // 'rust::Error'
#include "arith_final_tagless/src/" // 'parse'

int main() {
  char const* prompt = "\n% ";
  std::cout << "Additive expression evalutator (type CTRL+D to exit)" << prompt;

  std::string line;
  while(std::getline(std::cin, line)) {
    try {
      if (auto p = parse(rust::String{line})) {
        std::cout << line << " = " << p->expr << prompt;
    } catch (rust::Error const& e) {
      std::cerr << e.what() << prompt;

  return 0;

That’s it! If this is interesting to you and you’d like to go deeper into the technical details like, how to write the build scripts and so on for a program like this, the source is online here.

Sunday, February 7, 2021

Configuring Cabal Build Flags


Configuring Cabal build flags

It’s always an emergency when I go looking for this information!

Suppose you are building HLint.

mkdir ~/tmp && cd ~/tmp
curl -o hlint-3.2.7.tar.gz
gunzip  hlint-3.2.7.tar.gz && tar xvf  hlint-3.2.7.tar

Left to their own devices, HLint and ghc-lib-parser-ex will default to auto mode meaning, they will decide for themselves if they should depend on ghc-lib-parser or native ghc libraries.

Sometimes it’s desirable to force the situation though and explicitly make them do one or the other. How you do that? The answer is of course package Cabal flags. There are two scenarios: building with stack or building with cabal.

  • stack.yaml

    • Force link with ghc-lib-parser

            ghc-lib: true
            auto: false
            no-ghc-lib: false
    • Force link with native ghc

            ghc-lib: false
            auto: false
            no-ghc-lib: true
  • cabal.project

    • Force link with ghc-lib-parser

        packages: hlint-3.2.7
        package hlint
          flags: +ghc-lib
        package ghc-lib-parser-ex
          flags: -auto -no-ghc-lib
    • Force link with native ghc

        packages: hlint-3.2.7
        package hlint
          flags: -ghc-lib
        package ghc-lib-parser-ex
          flags: -auto +no-ghc-lib

When working with Cabal in the HLint repository one can configure with command line constraints like this:

cabal new-build --constraint "hlint -ghc-lib" --constraint "ghc-lib-parser-ex -auto +no-ghc-lib"

Monday, January 18, 2021

Two things in Rust


Two things in Rust

Two things I needed to learn before Rust made sense to me.

1 Pattern binding modes

I don’t remember reading about this in the book. Default binding modes come into play when non-reference patterns are encountered.


let x = Some(3);
let y: &Option<i32> = &x;
match y {
    Some(a) -> {
    // `y` is deref'd and `a` is bound as `ref a`
    None => {}

The default binding mode starts as move. Each time a reference is matched using a non-reference pattern; it will automatically derefence the vaue and update the default binding mode - If the reference is &val, set the default binding mode to ref - If the reference is &mut val: if the current default binding is ref, it should remain ref. Otherwise, set the current binding mode to ref mut.

Full details are given in the 2005-match-ergonomics rustlang RFC.


  • Example
match (&Some(3)) {
    Some(p) =>
      // This pattern is a "non-reference pattern".
      // Dereference the `&` and shift the default binding
      // mode to `ref`. `p` is read as `ref p` and given type `i32`.
   x => {
     // In this arm, we are still in `move` mode by default, so `x`
     // has type `&Option<32>`
  • Desugared
    match(&Some(3)) {
      &Some(ref p) => {
      x => {

2. Implict Deref Coercisons with Functions and Methods

This is another ergonomics feature that saves on explicit &s and *s when writing function and method calls. When we pass a reference to a function or method call deref implicitly as needed to coerce to the parameter target type.

  • Example:
fn hello(name: &str) {
    println!("Hello, {}", name);

fn main() {
    let m = MyBox::new(String::from("Rust"));
  • Example:
pub struct Point {
    x: Vec<i32>,
    y: Vec<i32>,

impl Point {
    pub fn x(&self) -> &Vec<i32> {
        match self {
            &Point { ref x, .. } => x,

fn use_i32(_: &i32) -> () {}

fn use_vi32(_: &Vec<i32>) -> () {}

fn use_str(_: &str) -> () {}

fn use_strr(_: &&String) -> () {}

fn main() {
    let p: Point = Point {
        x: vec![],
        y: vec![],

    let rp: &Point = &p;
    let rrp: &&Point = &rp;

    println!("p.x = {:?}", rrp.x());

    let _s: &str = "foo";
    let s: String = String::from(_s);
    let bs: Box<String> = Box::new(s.clone());
    let bsr: Box<&String> = Box::new(&s);
    let bi32: Box<Vec<i32>> = Box::new(vec![]);


    let r: &&String = &bsr;
    let r2: &String = r.deref();


    let p = Point {
        x: vec![1, 2, 3],
        y: vec![4, 5, 6],
    match &p {
        Point { x, y } => {
            println!("{:?}, {:?}", x, y);
    println!("{:?}, {:?}", p.x, p.y);

Friday, October 30, 2020

ghc-lib-parser module count

ghc-lib-parser module count

When a user installs a program like HLint via Cabal, there's a good chance they'll pay a cost of building ghc-lib-parser. The build time for ghc-lib-parser is proportional to the number of modules that it contains, the less there are the faster it builds. The fewer there are, the better the user experience installing HLint.

Back in September last year there was a bit of a scare. At that time, ghc-lib-parser consisted of around 165 modules (and ghc-lib had roughly 300). An MR landed that unintentionally resulted in ghc-lib-parser needing 543 modules (and ghc-lib getting just 25). Fortunately a quick refactoring sorted that out (thanks @sgraf1337!) As @mpickering_ correctly pointed out at that time "any fix should ensure adding a test to make sure it doesn't happen again". To that end, @cocreature and I managed to work out the details of one which we can have a look at in a minute.

Since a year has now elapsed since the test was introduced, it's interesting to see how the module count has changed over time. TL;DR it's not too bad — the number of modules has increased from around 165 a year ago to about 230 today:

So how does the test work? To be comfortable in the knowledge that the number of modules needed in ghc-lib-parser will not signficantly change without anyone noticing, it's enough to have a program that counts them. That program when inserted into GHC's continuous integration system alerts the committer to a limit breach. How do we count them? We use the GHC API to compute the transitive closure of the dependencies of GHC.Parser — the count is the cardinality of that. The code is only a few lines. The key to it is the function hscGetModuleInterface. You can read the source here.

Friday, April 17, 2020

Syntactic ambiguity resolution in the GHC parser

Syntactic ambiguity resolution in the GHC parser

There are places in the Haskell grammar where it's not known apriori whether it's an expression a command or a pattern that is being parsed. This used to be handled by picking a parse (e.g. as an expression say) and if that choice later turned out to be wrong, "rejigging it" (transform the constructed parse tree to its analog in the pattern language). The problem with that approach is that it meant having conflated sub-languages meaning, for example, HsExpr had to have pattern related constructors e.g. EWildPat, EAsPat (and further, these propogated into other compiler phases like the renamer and typechecker). This was the case until roughly a year ago before extraordinary work by Vladislav Zavialov who solved the ambiguity resolution issue by parsing into an abstraction with an overloaded representation:

class DisambECP b where ...
newtype ECP = ECP { unECP :: forall b. DisambECP b => PV (Located b) }
This innovation might be considered to have come at a cost for developers familiar with the "old" parser however. That is, dealing with understanding the apparent complexity introduced by the ambiguity resolution system. This post attempts to provide some intuition about how the system works and hopefully will lead to the realization that it's not that hard to understand after all!

Because this post is about building intuition, there are details that are glossed over or omitted entirely: the reader is encouraged to read Vlad's detailed explanatory comments in RdrHsSyn.hs when neccessary to address that.

We start with something familiar - the GHC parser monad:

newtype P a = P { unP :: PState -> ParseResult a }

This fundamentally is a wrapper over a function PState -> ParseResult a.

The (let's call it the) "ECP system" introduces a new (and as we'll see, very related) concept. The parser validator monad:

newtype PV a = PV { unPV :: PV_Context -> PV_Accum -> PV_Result a }

So a parser validator is a function similar in spirit to a parser where:

  • data PV_Context: The type of essentially a wrapper around the lexer ParserFlags value;
  • data PV_Accum: The type of state accumulated during parsing validation (like errors & warnings , comments, annotations);
  • data PV_Result: The parser validator function's result type that is, data PV_Result a = PV_Ok PV_Accum a | PV_Failed PV_Accum.

Of critical interest is how this type is made a monad.

instance Functor PV where
  fmap = liftM

instance Applicative PV where
  pure a = a `seq` PV (\_ acc -> PV_Ok acc a)
  (<*>) = ap

The above reveals that an expression like return e where e is of type Located b, constructs a function that given arguments ctx and acc returns e. The moral equivalent of const.

instance Monad PV where
  m >>= f = PV $ \ctx acc ->
    case unPV m ctx acc of
      PV_Ok acc' a -> unPV (f a) ctx acc'
      PV_Failed acc' -> PV_Failed acc'

The bind operation composes PV actions threading context and accumlators through the application of their contained functions: given an m :: PV a and a function f :: a -> PV b, then m >>= f constructs a PV b that wraps a function that composes f with the function in m.

PV is a bit more than a monad, it also satisfies the MonadP class for monads that support parsing-related operations providing the ability to query for active language extensions, store warnings, errors, comments and annotations.

instance MonadP PV where
  addError srcspan msg = ....
    PV $ \ctx acc@PV_Accum{pv_messages=m} ->
      let msg' = msg $$ pv_hint ctx in
      PV_Ok acc{pv_messages=appendError srcspan msg' m} ()
  addWarning option srcspan warning = ...
  addFatalError srcspan msg =...
  getBit ext =
    PV $ \ctx acc ->
      let b = ext `xtest` pExtsBitmap (pv_options ctx) in
      PV_Ok acc $! b
  addAnnotation (RealSrcSpan l _) a (RealSrcSpan v _) = ...

The function runPV is the interpreter of a PV a. To run a PV a through this function is to produce a P a.

runPV :: PV a -> P a

That is, given a PV a construct a function PState -> ParseResult a.

runPV m =
  P $ \s ->
      pv_ctx = PV_Context {...} -- init context from parse state 's'
      pv_acc = PV_Accum {...} -- init local state from parse state 's'
      -- Define a function that builds a parse state from local state
      mkPState acc' =
        s { messages = pv_messages acc'
          , annotations = pv_annotations acc'
          , comment_q = pv_comment_q acc'
          , annotations_comments = pv_annotations_comments acc' }
      -- Invoke the function in m with context and state, harvest its revised state and
      -- turn its outcome into a ParseResult.
      case unPV m pv_ctx pv_acc of
        PV_Ok acc' a -> POk (mkPState acc') a
        PV_Failed acc' -> PFailed (mkPState acc')

It is often the case that a production (or set of productions) might result different ASTs depending on the context. Ideally, we just want to write the productions once and reuse them across these different sub-languages (e.g. expressions vs. commands vs. patterns). For example, the production for a parenthesized "thing" is

'(' texp ')'

In the context of a pattern we expect an AST with a ParPat _ p node whereas in the context of an expression we want an AST with an HsPar _ e node. To this end the DisambECP class embodies an abstract set of operations for parse tree construction.

class DisambECP b where

  -- | Return a command without ambiguity, or fail in a non-command context.
  ecpFromCmd' :: LHsCmd GhcPs -> PV (Located b)
  -- | Return an expression without ambiguity, or fail in a non-expression context.
  ecpFromExp' :: LHsExpr GhcPs -> PV (Located b)

  ... Lots of operations like this
  mkHsOpAppPV :: SrcSpan -> Located b -> Located (InfixOp b) -> Located b -> PV (Located b)
  mkHsVarPV :: Located RdrName -> PV (Located b)


The idea is that in the semantic actions of the grammar we construct and compose parser validators in terms of these abstract functions. Running the PVs produces parsers and at the point of execution of parsers we know the context (the nature of the AST we expect to recive) and the concrete choices for each of the abstract functions is thereby fixed (and then, on evaluation, we get the parse result).

The only wrinkle is in the return type of productions that produce parser validators. In general, they will have the form forall b. DisambECP b => PV (Located b). If they were monadic productions though we would be led to P (forall b. DisambECP b => PV (Located b) and that dog don't hunt for GHC's lack of support for impredicative types. There is a standard work-around that can be employed though. This newtype is how impredicative types in monadic productions are avoided:

newtype ECP = ECP { unECP :: forall b. DisambECP b => PV (Located b) }

So here, ECP is a wrapper around a PV (Located b) value where b can be of any type that satisifies the constraints of class DisamECP. So, in a production that looks like

| ... {% return (ECP ...)}

we are dealing with P ECP whereas without a newtype we would be dealing with P (forall b. DisambECP b => PV (Located b)).

Now to produce a P (Located b) from the PV (Located b) in an ECP we can use this expression (of type DisambECP b => ECP -> P (Located b)):

runPV (unECP p)

It takes an ECP value, projects out the parser validator contained therein and "runs" it to produce a function from PState -> ParseResult a (a parser action).

From the DisabmECP instance for HsExpr GhcPs, here's ecpFromCmd':

  ecpFromCmd' (L l c) = do
    addError l $ vcat
      [ text "Arrow command found where an expression was expected:",
        nest 2 (ppr c) ]
    return (L l hsHoleExpr)

Makes perfect sense - you get a parser validator that when evaluated will store a (non-fatal) error and returns an expression "hole" (unbound variable called _) so that parsing can continue.

Continuing, the definition of ecpFromExp':

  ecpFromExp' = return

Also sensible. Simply calculate a function that returns its provided acc argument together with the given constant expression under a PV_Ok result (see the definition of pure in the Appliciatve instance for PV given above).

Parenthesizing an expression for this DisambECP instance means wrapping a HsPar around the given e:

  mkHsParPV l e = return $ L l (HsPar noExtField e)

And so on. You get the idea.

So how does this all fit together? Consider agin the production of parenthesized things:

        | '(' texp ')'  { ECP $
                            unECP $2 >>= \ $2 ->
                            amms (mkHsParPV (comb2 $1 $>) $2) [mop $1,mcp $3] }

We note that the texp production calculates an ECP. Stripping away for simplicity the annotation and source code location calculations in the semantic action, in essence we are left with this.

ECP $ unECP $2 >>= \ $2 -> mkHsParPV $2

The effect of unECP is to project out the forall b. DisambECP b => PV (Located b) value from the result of texp. Recalling that unPV projects out the function that the PV wrapper shields and by substition of the definition of bind, we obtain roughly:

  ECP $ PV $ \ctx acc ->
                case unPV (unECP $2) ctx acc of
                  PV_Ok acc' a -> unPV (mkHsParPV a) ctx acc'
                  PV_Failed acc' -> PV_Failed acc'

The net effet is we construct a new parser validatior (function) from the parser validator (function) returned from the texp production that puts parenthesis around whatever that function when evaluated produces. If used in a context where texp generates a LPat GhcPs that'll be a ParPat node, if an LHsExpr GhcPs, then a HsPar node.

Sunday, April 5, 2020

GHC: How whitespace sensitive operator lexing works

How whitespace sensitive operator lexing works

In GHC, Haskell operator occurrences get classified into one of four categories. For example, the occurrence of ⊕ in a ⊕ b is "loose infix", in a⊕b is "tight infix", in a ⊕b is "prefix" and in a⊕ b, "suffix"

The point of this is that certain operators can be ascribed different meanings depending on the classification of their occurrence and language extensions that may be in effect. For example, ! when encountered will lex as strictness annotation (token type ITbang) if its occurrence is prefix (e.g. f !x = rhs) or an ordinary operator (token type ITvarsym ) if not (e.g. xs ! 3). Another ready example is provided by operator @ which, according to whitespace considerations, may be a type application (prefix), an as-pattern (tight infix), an ordinary operator (loose infix) or a parse error (suffix).

The implementation of this categorization relies upon two functions: followedByOpeningToken and precededByClosingToken. To explain further:

  • Identifiers, literals and opening brackets (, (#, [|, [||, [p|, [t|, { are considered "opening tokens";
  • Identifiers, literals and closing brackets ), #), ], |], } are considered "closing tokens";
  • Other tokens and whitespace are considered neither opening or closing.

The classification algorithm is defined by the following rules:

TrueTruetight infix
FalseFalseloose infix

The implementation of precededByClosingToken is very straightforward: look backwards one character in the lexing buffer.
precededByClosingToken :: AlexAccPred ExtsBitmap
precededByClosingToken _ (AI _ buf) _ _ =
  case prevChar buf '\n' of
    '}' -> decodePrevNChars 1 buf /= "-"
    ')' -> True
    ']' -> True
    '\"' -> True
    '\'' -> True
    '_' -> True
    c -> isAlphaNum c
Similarly, followedByOpeningToken: look forwards one character in the lexing buffer.
followedByOpeningToken :: AlexAccPred ExtsBitmap
followedByOpeningToken _ _ _ (AI _ buf)
  | atEnd buf = False
  | otherwise =
      case nextChar buf of
        ('{', buf') -> nextCharIsNot buf' (== '-')
        ('(', _) -> True
        ('[', _) -> True
        ('\"', _) -> True
        ('\'', _) -> True
        ('_', _) -> True
        (c, _) -> isAlphaNum c
Armed by these rules, the lexing of operators looks like this:
<0> {
  @varsym / { precededByClosingToken `alexAndPred` followedByOpeningToken } { varsym_tight_infix }
  @varsym / { followedByOpeningToken }  { varsym_prefix }
  @varsym / { precededByClosingToken }  { varsym_suffix }
  @varsym                               { varsym_loose_infix }

The actions varsym_tight_infix, varsym_prefix, varsym_suffix and varsym_loose_infix are "fed" the operator and allow for language extension specific issuance of tokens (as opposed to issuance of general ITvarsym tokens). For example, varsym_prefix :

varsym_prefix :: Action
varsym_prefix = sym $ \exts s ->
  if | TypeApplicationsBit `xtest` exts, s == fsLit "@"
     -> return ITtypeApp
     |  ...
     | otherwise -> return (ITvarsym s)

Sunday, March 1, 2020

GHC Haskell Pats and LPats

GHC Haskell Pats and LPats

In the Trees that Grow paper, it is explained that GHC has a single data type HsSyn that crosses several compiler phases; a second data type TH.Syntax for Template Haskell and that other Haskell libraries e.g. haskell-src-exts defnining yet others. Ideally, HsSyn would be reused in Template Haskell and these third-party libraries and motivates the flexibilities offered by the TTG (Trees That Grow) techniques.

Before GHC 8.8, patterns and located patterns were related in the following way:

type LPat = Located Pat
data Pat p
  = ...
  | LazyPat (XLazyPat p) (LPat p)
That is, patterns with locations are represented by values of type LPat and patterns themselves as values of type Pat. Note that LPat values contain Pat values which in turn can contain LPat values hence the name "ping pong style" being given to this idiom.

Since location annotations may (e.g. GHC native) or may not (e.g. Template Haskell) be present for a given application it is realized that "baking" locations into HsSyn is undesirable. For this reason, in 8.8 attempts were made to make their presence a strictly GHC "thing" in the following way:

type LPat p = Pat p
data Pat p
  = ...
  | LazyPat (XLazyPat p) (LPat p)
  | ...
  | XPat (XXPat p)
type instance XXPat (GhcPass p) = Located (Pat (GhcPass p))
That is, in GHC under this approach, locations are stored in the extension constructor - patterns with locations are wrapped in XPat e.g. XPat noExt (L _ (VarPat noExt _)). Of course, now, to get at the location you have to go through an indirection through XPat. For this, the functions cL and dL (and the bi-directional pattern synonym LL) were provided. Applications that don't want locations in the parse tree just don't make use of the XPat constructor.

It turned out that the 8.8 approach wasn't as good an idea as it seemed; it was a bit more complicated than it needed to be and had some unexpected implications for the existing GHC source code base. It was realized that this following alternative approach yields the same benefits and is what we find in 8.10 and beyond:

type family XRec p (f :: * -> *) = r | r -> p f
type instance XRec (GhcPass p) f = Located (f (GhcPass p))

type LPat p = XRec p Pat
data Pat p
  = ...
  | LazyPat (XLazyPat p) (LPat p)
  | ...
  | XPat (XXPat p)
type instance XXPat   (GhcPass _) = NoExtCon
Thus for GHC, ping-pong style is restored and applications other than GHC can define the XRec instance as simply f p so that locations are absent.

In practical terms, going from 8.8 to 8.10 LL becomes L, dL -> is removed and cL is just L.