Thursday, March 19, 2015

Y Combinator

There is this blog post by Caltech computer scientist, Mike Vanier. The code in Mike's article uses the Scheme programming language. This one uses Python.

A $Y$ combinator is a higher-order function which, given argument $f$ (say,) satisfies the equation $Y f = f\;(Y f)$. They are said to compute a "fixed point" $f^{'}$ of their argument $f$ since $f^{'} = Y\;f = f\;(Y\;f) = f \;f^{'}$.

A $Y$ combinator takes a function that isn't recursive and returns a version of the function that is. That is, $Y$ is a function that takes $f$ and returns $f (f (f (\cdots)))$.

The existence of $Y$ combinators is amazing in that it tells us that it is possible to write recursive functions even in a programming language that says nothing about recursion!

The goal here is to derive a $Y$.

Start with the classic recursive definition of the factorial function.

  def fact (n) :
    if n == 0 : 
      return 1 
      return n * fact (n - 1)

We are trying to eliminate explicit recursion. To that end, factor out the recursive call and make it an application of a function argument.

def part_fact (this, n):
  if n == 0 :
    return 1
     return n * this (this, (n - 1))

fact = functools.partial(part_fact, part_fact)
That's sufficient to get rid of the explicit recursion but we mean to push on in search of a "library" solution to this problem, that is, some general result that can be re-used.

Next let's get this down to a function in one argument as is this way in the $\lambda$ calculus.

  def part_fact (this):
    return lambda n : 1 if n == 0 else n * (this (this)) (n - 1)

  fact = part_fact (part_fact)

We'd recover something tantalizingly close to the original factorial function if we factored out this (this) into a function of one argument.

def part_fact (this):
  f = this (this)
  return lambda n : 1 if n == 0 else n * f (n - 1)
fact = part_fact(part_fact)

This would be fine in a lazy language but Python is a strict language and exhibits infinite recursion because to evaluate part_fact (part_fact) requires evaluating f = part_fact (part_fact) and so on. The solution is to delay the evaluation until it's needed.

def part_fact (this):
  f = lambda y : (this (this)) y
  return lambda n : 1 if n == 0 else n * f (n - 1)
fact = part_fact(part_fact)

Refactor this into two parts.

def almost_fact (f):
  return lambda n : 1 if n == 0 else f (n - 1)

def part_fact (this):
  f = lambda y : (this (this))(y)
  return almost_fact (f)

fact = part_fact(part_fact)

Rephrase part_frac as a lambda and change the argument name to x.

def almost_fact (f):
  return lambda n : 1 if n == 0 else f (n - 1)

part_fract = lambda x : almost_fact (lambda y : (x (x))(y))

fact = part_fact (part_fact)

Eliminate 'part_fact'.

def almost_fact (f):
  return lambda n : 1 if n == 0 else f (n - 1)

fact = (lambda x : almost_fact (lambda y : (x (x))(y))) \
          (lambda x : almost_fact (lambda y : (x (x))(y)))

That's it! There's the $Y$ combinator. Generalize.

def Y (f):
 return (lambda x : f (lambda y : (x (x))(y))) \
           (lambda x : f (lambda y : (x (x))(y)))

def almost_fact (f):
  return lambda n : 1 if n == 0 else f (n - 1)

fact = Y (almost_fact)
That is, $Y = \lambda f.(\lambda x. f\;(\lambda y. (x\;x)\;y))\;(\lambda x. f\;(\lambda y. (x\;x)\;y))$. This $Y$ combinator is known s Curry's paradoxical combinator (in its "applicative order" form to account for the fact that Python is strict).

Try this on another function. Fibonacci numbers say.

def almost_fib (f) :
  return lambda n : 1 if n <= 2 else f (n - 1) + f (n - 2)

fib = Y (almost_fib)

print (str (fib (6))+"\n") #Prints '8'

Saturday, March 14, 2015

Labeled and optional arguments

I don't know why this is, but of all the features of the OCaml language, somehow remembering how to make use of labeled or optional arguments is difficult for me. I find when I need to use them, I have to go back and "relearn" the syntax every time. For that reason, I've written these summary notes for myself (based mostly on Jacques Garrigue's "Labels and Variants" chapter of the OCaml reference) and I hope they might be of help you too!


I've seen people write this function from time to time.
let flip (f : 'a -> 'b -> 'c) = fun y x -> f x y

flip applied to a function f in two arguments, produces a new function that when presented reverses their order in the application of f. For example, if sub is defined by let sub x y = x - y, then sub 3 2 will give the result 1 whereas (flip sub) 3 2 will produce the value -1.

The reason you'd find yourself wanting a function like flip is because you want to partially apply f but the argument you want to bind is inconveniently in the 'wrong' position. For example, the function that subtracts 2 from it's argument can be written flip (sub 2) (whereas sub 2 is the function that subtracts its argument from 2).

Labeled arguments do away with the need for functions like flip. You see, arguments in function types can be annotated.

val sub : (x:int) -> (y:int) -> int
In the above, x and y are labels. When you define a function with labeled arguments, you prefix the labels with a tilde ~.
let sub : (x:int -> y:int -> int) = fun ~x ~y -> x - y

Labels and variables are not the same things. A label is a name associated with a variable e.g. let u = 3 and v = 2 in sub ~x:u ~y:v (u, v are variables x, y labels) and this also works when the variables are replaced with literals as in sub ~x:3 ~y:2.

The expression ~x on it's own is shorthand for ~x:x which means that you can conveniently write let x = 3 and y = 2 in sub ~x ~y for example. This is "punning" - meaning if the variable name and the label are the same, the variable name is permitted to be elided in the application.

Labels enable making function application commutative (changing the order of the arguments does not change the result); one can write sub ~y ~x and that will yield the same result as sub ~x ~y. Where this is useful of course is partial application of a function on any any argument. For example, we can create functions from sub by binding either of x or y such as sub ~x:3 or sub ~y:2 without having to resort to ad-hoc trickery like flip sub.

Some technical details:

  • If more than one argument of a function has the same label (or no label), position prevails and they will not commute among themselves;
  • If an application is total, labels can be omitted (e.g. like in sub 3 2). Be wary though - when a function is polymorphic in its return type, it will never be considered as totally applied. Consider for example , let app ~f ~x = f x. Then app has type f:('a -> 'b') -> x:'a -> 'b. Now we attempt total application but omit the required labels app (fun x -> x) 1 and this types to f:('a -> ('b -> 'b) -> int -> 'c) -> x:'a -> 'c. You can see that the arguments have been "stacked up" in anticipation of presentation of f and x and the result although it types, is not at all what we were going for!
  • When a function is passed to a higher order function, labels must match in both types, labels may not be added or removed.

Optional arguments

In a function type involving an optional argument, the annotation ? appears.
val range : ?step:int -> int -> int -> int list
The ? appears again in the value syntax of a function definition involving an optional argument. Additionally, at that time, optional parameters can be given default values. When calling a function with a value for an optional parameter, well, you use the fact it is a labeled argument and provide a ~ as normal.
let rec range ?(step:int = 1) (m : int) (n : int) : int list = 
    if m >= n then [] 
    else m :: (range ~step (m + step) n)

Now, this bit is important. A function can not take optional arguments alone. There must be some non-optional ones there too and it is when a non-optional parameter that comes after an optional one in the function type is encountered is it determined if the optional parameter has been omitted or not. For example, if we reorder the argument list as in the below, we find we can't 'erase' the optional argument anymore.

let rec range (m : int) (n : int) ?(step : int = 1): int list = ...
Warning 16: this optional argument cannot be erased.
That said, optional parameters may commute with non-optional or unlabeled ones, as long as they are applied simultaneously. That is, going back to this definition,
let rec range ?(step:int = 1) (m : int) (n : int) : int list = 
    if m >= n then [] 
    else m :: (range ~step (m + step) n)
(range 0 ~step:2) 100 is valid whereas, (range 0) ~step:2 100 is not.

Optional arguments are implemented as option types. In the event no default is given, you can treat them as exactly that by matching on None or Some x to switch behaviors. Here's a schematic of what that looks like.

let f ?s y =
    match s with
      | None -> ...
      | Some x -> ...  

When you are just "passing through" an optional argument from one function call to another, you can prefix the applied argument with ? as in the following.

 let g ?s x = ...
 let h ?s x = g ?s x  (*[s] is passed through to [g]*)

Friday, March 6, 2015

Heap sort

Given the existence of a priority queue data structure, the heap sort algorithm is trivially implemented by loading the unsorted sequence into a queue then successively pulling of the minimum element from the queue until the queue is exhausted.

There are many ways to implement a priority queue and so we seek an expression for a function for heap sorting that is polymorphic over those choices.

To begin, a module type for priority queues.

(**Priority queues over ordered types*)
module type PRIORITY_QUEUE = sig

    (**Output signature of the functor [Make]*)
    module type S = sig
        exception Empty

        type element (*Abstract type of elements of the queue*)
        type t (*Abstract type of a queue*)

        val empty : t (*The empty queue*)
        val is_empty : t -> bool (*Check if queue is empty*)
        val insert : t -> element -> t (*Insert item into queue*)
        val delete_min : t -> t (*Delete the minimum element*)
        val find_min : t -> element (*Return the minimum element*)
        val of_list : element list -> t

    (**Input signature of the functor [Make]*)
    module type Ordered_type = sig
        type t
        val compare : t -> t -> int

   (**Functor building an implementation of the priority queue structure
      given a totally ordered type*)
    module Make : 
       functor (Ord : Ordered_type) -> S with type element = Ord.t

An implementation of this signature using "leftist heaps" is described for the interested in this Caltech lab but such details are omitted here.

module Priority_queue : PRIORITY_QUEUE = struct 
  module type S = sig .. end
  module type Ordered_type = sig .. end
  module Make (Elt : Ordered_type) : (S with type element = Elt.t) = struct .. end

What I really want to show you is this. We start with the following module type abbreviation.

type 'a queue_impl = (module Priority_queue.S with type element = 'a)
Then, the heap_sort function can be written such that it takes a module as a first class value and uses a locally abstract type to connect it with the element type of the list to be sorted.
let heap_sort (type a) (queue : a queue_impl) (l : a list) : a list =
  let module Queue = 
     (val queue : Priority_queue.S with type element = a) in
  let rec loop acc h =
    if Queue.is_empty h then acc
      let p = Queue.find_min h in
      loop (p :: acc) (Queue.delete_min h) in
  List.rev (loop [] (Queue.of_list l))
There we have it. The objective has been achieved : we have written a heap sorting function that is polymorphic in the implementation of the priority queue with which it is implemented.

Usage (testing) proceeds as in this example.

(*Prepare an [Priority_queue.Ordered_type] module to pass as argument
  to [Priority_queue.Make]*)
module Int : Priority_queue.Ordered_type with type t = int = struct 
  type t = int let compare = 

(*Make a priority queue module*)
module Int_prioqueue : (Priority_queue.S with type element = int) = Priority_queue.Make (Int)

(*Make a first class value of the module by packing it*)
let queue = (module Int_prioqueue : Priority_queue.S with type element = int)

(*Now, pass the module to [heap_sort]*)
let sorted = heap_sort queue [-1; -2; 2] (*Produces the list [-2; -1; 2]*)

Addendum :

These ideas can be pushed a little further yielding a simpler syntax for the parametric heapsort algorithm.
(*Type abbreviations*)

type 'a order_impl = (module Priority_queue.Ordered_type with type t = 'a)
type 'a queue_impl = (module Priority_queue.S with type element = 'a)

(*Module factory functions*)

let mk_ord : 'a. unit -> 'a order_impl =
  fun (type s) () ->
       type t = s 
       let compare = 
     end : Priority_queue.Ordered_type with type t = s

let mk_queue : 'a. unit -> 'a queue_impl =
  fun (type s) ord ->
    let module Ord = 
      (val mk_ord () : Priority_queue.Ordered_type with type t = s) in
    (module Priority_queue.Make (Ord) :
                               Priority_queue.S with type element = s)
For example, now we can write
# heap_sort (mk_queue ()) [-3; 1; 5] ;;
  - : int list = [-3; 1; 5]

Sunday, February 15, 2015

Fold left via fold right

The puzzle is to express fold_left entirely in terms of fold_right. For example, an attempted solution like

let fold_left f e s =
  List.rev (fold_right (fun a acc -> f acc a) e (List.rev s))
is inadmissible because it relies on List.rev and thus is not entirely in terms of fold_right.

Recall that given a function $f$, a seed $e$ and a list $[a_{1}; a_{2}; \cdots; a_{N}]$, fold_left computes $f\;(\cdots f\;(f\;e\;a_{1})\;a_{2}\;\cdots)\;a_{N}$ whereas fold_right computes $f\;a_{1}\;(f\;a_{2}\;(\cdots\;(f\;a_{N}\;e)\cdots))$. There's really no choice but to delay computation and the expression that solves this problem is this.

let fold_left f e s =
  List.fold_right (fun a acc -> fun x -> acc (f x a)) s (fun x -> x) e

For example, in the top-level

 # fold_left (fun acc x -> x * x :: acc) [] [1; 2; 3;] ;;
   - : int list = [9; 4; 1]

To see how this works, consider the right fold over the list $[a_{1}; a_{2}; a_{3}]$ (say)

  • On encountering $a_{3}$ we compute $f_{3} = \lambda x_{3} . i\; (f\;x_{3}\;a_{3})$;
  • On encountering $a_{2}$ we compute $f_{2} = \lambda x_{2} . f_{3}\;(f\;x_{2}\;a_{2})$;
  • On encountering $a_{1}$ we compute $f_{1} = \lambda x_{1} . f_{2}\;(f\;x_{1}\;a_{1})$;
but then, $f_{1}\;e = f_{2}\;(f\;e\;a_{1}) = f_{3}\;(f\;(f\;e\;a_{1})\;a_{2}) = f\;(f\;(f\;e\;a_{1})\;a_{2})\;a_{3}$ as desired.

Saturday, February 7, 2015

Recursive lists in C++

Earlier this week, I had a need for a recursive list, that is, a list defined in terms of itself. I think, "back in the day" implementing a data structure of that sort would have been a snap for the everyday C programmer. Today, in this modern C++ world I found myself struggling a little and came to think that maybe the old ways are fading :)

For motivation, here's a couple of examples of the sort of thing I'm talking about.

(1) The list [0; 1; 0; 1; 0; 1; ...] is a list with a cycle in it. In OCaml you'd write that as let rec l = 0 :: 1 :: l.

(2) An interpreter using the technique of environments and closures can require an environment ((string * value) list) to contain a closure where the closure contains the environment. In OCaml you'd write let rec vars = (tag, V_closure (vars, xpr)) :: !env; env := vars.

Of course with pointers, it's not hard to implement recursive structures in C++. The trouble is having to concern yourself with their memory management due to the absence of garbage collection.

Alright, here is what I came up with. The code is pretty short.

#include <boost/variant.hpp>

#include <memory>
#include <stdexcept>

template <class T> struct node;
template <class T> using node_ptr=typename node<T>::node_ptr;
template <class T> using node_weak_ptr=typename node<T>::weak_ptr;
template <class T> using node_shared_ptr=typename node<T>::shared_ptr;
template <class T> struct ptr_t;
template <class T> using list=ptr_t<node<T>>;
template <class T> using list_ref=node_weak_ptr<T>;

template <class T> list<T> nil (); 
template <class T> bool empty (list<T> l);
template <class T> list<T> cons (T val, list<T> l);
template <class T> T& hd (list<T> l);
template <class T> list<T>& tl (list<T> l);
template <class T> list_ref<T> ref (list<T> src);
template <class T> bool is_ref (list<T> src);

The idea behind the implementation is generalize a pointer to node as a union with two variants, a shared pointer or a weak pointer.

template <class T> struct ptr_t :
  boost::variant <std::shared_ptr<T>, std::weak_ptr<T>> {
  typedef boost::variant <std::shared_ptr<T>, std::weak_ptr<T>> base;
  ptr_t () {}
  ptr_t (std::weak_ptr<T> p) : base (p) {}
  ptr_t (std::shared_ptr<T> p) : base (p) {}

template <class T>
struct node {
  typedef ptr_t<node> node_ptr;
  typedef std::weak_ptr<node> weak_ptr;
  typedef std::shared_ptr<node> shared_ptr;

  T data;
  node_ptr next;

This little bit of implementation detail comes up a couple of times so it's handy to factor it out.

namespace {
  //'get' at the raw pointer in the union of a smart/weak pointer
  template <class T>
  T* get (ptr_t<T> l) {
    if (std::shared_ptr<T>* p=
        boost::get<std::shared_ptr<T>>(&l)) {
      return p->get ();
    return boost::get<std::weak_ptr<T>>(l).lock ().get ();

The rest of the implementation is basically a set of "one-liners".

template <class T> list<T> nil (){ 
  return node_shared_ptr<T> (); 

template <class T> bool empty (list<T> l) { 
  return (get (l)) == nullptr; 

template <class T> list<T> cons (T val, list<T> l) {
  return node_shared_ptr<T> (new node<T>{val, l});

template <class T> T& hd (list<T> l) {   
  if (empty (l))
    throw std::runtime_error ("hd");
  return get (l) -> data;

template <class T> list<T>& tl (list<T> l) { 
  if (empty (l))
    throw std::runtime_error ("tl");
  return get (l) -> next; 

template <class T> bool is_ref (list<T> src) {
  return boost::get<list_ref<T>>(&src)!=nullptr;

template <class T> node_weak_ptr<T> ref (list<T> src)  {
  return node_weak_ptr<T>(boost::get<node_shared_ptr<T>>(src));

OK, well, that's about it. Let's see, regarding usage, (1) could be expressed like this

list<int> l = cons (0, cons (1, nil<int> ())); tl (tl (l)) = ref (l);
or, if we assume the existence of a 'last' function with an obvious definition, could be tidied up to read
list<int> l = cons (0, cons (1, nil<int> ())); tl (last (l)) = ref (l);
and (2) can be stated like this
typedef std::pair<std::string, value_t> p_t;
list<p_t> vars = node_shared_ptr<p_t>(new node<p_t>);
hd (vars) = std::make_pair (tag, V_closure {ref (vars), xpr});
tl (vars) = *env; 
*env = vars;

Monday, December 22, 2014

Compiling regular expressions (II)

This article is a continuation of the earlier post, "Compiling regular expressions (I)".

Automata are modeled as 'state' records with two fields. The pos field contains the set of positions that are valid for recognition in the given state. Transitions are modeled as lists of pairs of symbols and states. In this way a state may contain transitions that reference itself.

type state = {
  pos : Int_set.t;
  mutable trans : (char * state) list ;

We will require a function that for each input symbol $a$ and a given set of positions $s$, computes the list of pairs $(a, s')$ where $s'$ is the subset of $s$ that correspond to $a$.

let (partition : char option array -> Int_set.t 
               -> (char option * Int_set.t) list) =
  fun chars s ->
    let f acc c =
      match c with
      | Some _ ->
        if List.mem_assoc c acc then acc 
          let f i acc = 
            if chars.(i) <> c then acc else Int_set.add i acc in
          (c, Int_set.fold f s (Int_set.empty)) :: acc
      | None -> 
        if List.mem_assoc c acc then acc else (c, Int_set.empty) :: acc in
    List.rev (Array.fold_left f [] chars)
This function makes a list from a set of ints.
let list_of_int_set : Int_set.t -> Int_set.elt list = 
  fun s -> List.rev (Int_set.fold (fun e acc -> e :: acc) s [])
This function, accessible given a state, computes the list of sets that accessible from that state.
let (accessible : state -> Int_set.t array -> char option array 
                                      -> (char * Int_set.t) list) =
  fun s follow chars ->
    let part = partition chars s.pos in
    let f p rest =
      match p with
      | (Some c, l) -> 
           ( (Array.get follow) (list_of_int_set l))
        ) :: rest
      | _ -> rest  in
    List.fold_right f part []
find_state takes a set $s$ and two lists of states (marked and unmarked). It searches for a state which has a pos field equal to $s$ and returns this state or it fails.
let (find_state : Int_set.t -> state list -> state list -> state) =
  fun s l m ->
    let test e = e.pos = s in
      List.find test l
    | Not_found -> List.find test m

The algorithm to compute the automata works like this. Two lists are maintained, marked and unmarked states. The algorithm is initialized such that the only state is unmarked with a pos field containing first_pos $n_{0}$ where $n_{0}$ is the root of the syntax tree; the list of transitions is empty.

For an unmarked state $st$, the algorithm does these things:

  • Calculate a set of numbers accessible from $st$. That is, a set of pairs $(c, s)$, where $c$ is a character and $s$ a set of positions. A position $j$ is accessible from $st$ by $c$ if there is an $i$ in st.pos such that $j$ is in follow $i$ and $i$ numbers the character $c$.
  • For each of the pairs $(c, s)$
    • If there exists a state st' (whether marked or unmarked) such that $s = $st'.pos, it adds $(c, st')$ to the transitions of $st$;
    • Otherwise, a new state $st'$ without transitions is created, added to the transitions of $st$, and $st'$ is added to the list of unmarked states.
  • It marks $st$.
The algorithm terminates only when there are no remaining unmarked states. The result is an array of states obtained from the list of marked states. The terminal states are all those containing the number associated with Accept. Here then is the algorithm in code.
let rec (compute_states : state list -> state list -> Int_set.t array 
                                   -> char option array -> state array) =
  fun marked unmarked follow chars ->
    match unmarked with
    | [] -> Array.of_list marked
    | st :: umsts ->
      let access = accessible st follow chars in
      let marked1 = st :: marked in
      let f (c, s) umsts =
        if Int_set.is_empty s then 
          umsts (*Suppress empty sets*)
            st.trans <- (c, find_state s marked1 umsts) ::st.trans ;
          | Not_found -> 
            let state1 = {pos = s; trans = []} in
            st.trans <- (c, state1) :: st.trans;
            state1 :: umsts in
      let unmarked1 = List.fold_right f access umsts in
      compute_states marked1 unmarked1 follow chars

We are just about ready to write the function to compute the automaton. It is fundamentally a call to compute_states but does one more thing. That is, it searches the resulting array for the index of the initial state and puts the index in the first slot of the array. To do this it uses the utility function array_indexq which performs the search for the index using physical equality. This is because the usual test using structural equality will not terminate on structures that loop.

let (array_indexq : 'a array -> 'a -> int) =
  fun arr e ->
    let rec loop i =
      if i = Array.length arr then
        raise (Not_found)
      else if Array.get arr i == e then i
      else loop (i + 1) in
    loop 0
So, here it is, dfa_of, the function to compute the automaton.
let (dfa_of : augmented_regexp * Int_set.t array * char option array 
                                                        -> state array) =
  fun (e, follow, chars) ->
    let init_state = {pos = first_pos e; trans = []} in
    let dfa = compute_states [] [init_state] follow chars in
    (*Installing initial state at index 0*)
    let idx_start = array_indexq dfa init_state in
    dfa.(idx_start) <- dfa.(0);
    dfa.(0) <- init_state;

We are now on the home stretch. All that remains is to write a function to interpret the automaton. To do this, we'll make use of a mini-combinator library of recognizers. I'll not provide the OCaml code for that today - you could reverse engineer from my earlier 'Recognizers' blog-post or, consult [1].

let (interpret_dfa : state array -> int -> char Recognizer.recognizer) =
  fun dfa accept ->
    let num_states = Array.length dfa in
    let fvect = Array.make (num_states) (fun _ -> failwith "no value") in
    for i = 0 to num_states - 1 do
      let trans = dfa.(i).trans in
      let f (c, st) =
        let pc = Recognizer.recognizer_of_char c in
        let j = array_indexq dfa st in
        Recognizer.compose_and pc (fun l -> fvect.(j) l) in
      let parsers = f trans in
      if Int_set.mem accept (dfa.(i).pos) then
        fvect.(i) <- compose_or_list 
                        (Recognizer.end_of_input) parsers
      else match parsers with
      | [] -> failwith "Impossible"
      | p :: ps -> fvect.(i) <- Recognizer.compose_or_list p ps
We wrap up with a couple of high level convenience functions : compile produces a recognizer from a string representation of a regular expression and match takes a recognizer (that is, a compiled regular expression) and a string and uses the recognizer to categorize the given string as admissible or not (where explode is a simple function that transforms a string into a char list - recognizers operate on lists).
let compile xpr =
    let ((e, follow, chars) as ast) = regexp_follow xpr in
    let dfa = dfa_of ast in
    let parser = interpret_dfa dfa (Array.length chars - 1) in
    fun s -> parser (explode s)

let re_match xpr s =
  let result = xpr s in
  match result with
  | Recognizer.Remains [] -> true
  | _ -> false

Here's a simple test driver that shows how these functions can be used.

let test xpr s = 
  match re_match xpr s with
  | true -> Printf.printf "\"%s\" : success\n" s
  | false -> Printf.printf "\"%s\" : fail\n" s

let _ = 
    let xpr = compile "(a|b)*abb" in
    Printf.printf "Pattern: \"%s\"\n" "(a|b)*abb" ;
    test xpr "abb" ;
    test xpr "aabb" ;
    test xpr "baabb" ;
    test xpr "bbbbbbbbbbbbbaabb" ;
    test xpr "aaaaaaabbbaabbbaabbabaabb" ;
    test xpr "baab" ;
    test xpr "aa" ;
    test xpr "ab" ;
    test xpr "bb" ;
    test xpr "" ;
    test xpr "ccabb" ;
  | Failure msg -> print_endline msg

So that's it for this series of posts on building recognizers for regular expressions. Hope you enjoyed it!

[1] "The Functional Approach to Programming" - Cousineau & Mauny
[2] "Compilers Principles, Techniques & Tools" - Aho et. al.

Thursday, December 18, 2014

Compiling regular expressions (I)

This post picks up from here which was concerned with parsing - obtaining representations of regular expressions as abstract syntax trees. The ultimate goal is, given a string representation of a regular expression $e$ , produce a 'recognizer' for the expression (referred to as compiling a regular expression). That is, a function string -> bool that can be used to categorize strings as either belonging to the language $\mathcal{L_{e}}$ or not.

Having produced an abstract syntax tree for a regular expression $e$, the first step in compiling the expression is to compute an abstract syntax tree of the corresponding augmented regular expression $(e)\#$. This augmented regular expression is the original expression $e$ concatenated with a unique end-marker $\#$. For the given expression $e$, the accepting state for $e$ is given a transition on $\#$. This is a device that allows us to "forget" about accepting states as the computation of a recognizer proceeds; when the construction is complete, any state with a transition on $\#$ must be an accepting state.

Leaves in the abstract syntax tree of the augmented regular expression $(e)\#$ are labeled by $\epsilon$ or a symbol from from $\mathcal{A}$. For those non-$\epsilon$ leaves we attach a unique integer. Accordingly, we will need functions to generate unique integers (positions) that we will employ as we transform the AST of $e$ into the AST of the augmented expression $(e)\#$ leading to our first code example.

let reset_label, generate_label =
 let r = ref (-1) in
 ((fun () -> r := (-1)), (fun () -> r := !r + 1; !r))

As we construct the syntax tree of the $(e)\#$ we compute four functions : null_pos, first_pos, last_pos and following:

  1. null_pos is $true$ for a syntax-tree node $n$ if and only if the sub-expression represented by $n$ has $\epsilon$ in its language. That is, $true$ if the regular sub-expression recognizes the empty string and $false$ otherwise;
  2. first_pos is the set of positions in the sub-tree rooted at $n$ that correspond to the first symbol of at least one string in the language of the sub-expression rooted at $n$. That is, the set of symbols that can begin a string recognized by the regular sub-expression;
  3. last_pos is the set of positions in the sub-tree rooted at the syntax-tree node $n$ that corresponds to the last symbol of at least one string in the language of the sub-expression rooted at $n$. That is, the set of symbols that can terminate a string recognized by the regular sub-expression;
  4. following, for a position $p$ is the set of positions $q$ in the entire syntax-tree such that there is some string $x = a_{1}a_{2} \cdots a_{n}$ in $\mathcal{L_{(e)\#}}$ such that for some $i$, there is some way to explain the membership of $x$ in $\mathcal{L_{(e)\#}}$ by matching $a_{i}$ to $p$ and $a_{i + 1}$ to a position in $q$.
Of these, following is the last to compute as it relies upon the values of first_pos and last_pos. If the definition is confusing for now, don't worry about it. The rules for computing following will come later and will be obvious at that point. We'll focus for now on null_pos, first_pos and last_pos.

The results of first_pos, last_pos and follow are sets of integers. Accordingly, we are going to need a type to represent these.

module Int_set : Set.S with type elt = int = Set.Make (
    let compare =
    type t = int
With this we can present the type of ASTs for augmented regular expressions.
type augmented_regexp =
  | Epsilon
  | Character of char * int
  | Sequence of augmented_regexp * augmented_regexp * pos
  | Alternative of augmented_regexp * augmented_regexp * pos
  | Repetition of augmented_regexp * pos
  | Accept of int
and pos = {

For a given node $n$, the values of its pos record depend only on the sub-expressions of that node. Assuming constructed augmented regular expression syntax trees, we can write null_pos, first_pos and last_pos like this.

let (null_pos : augmented_regexp -> bool)  =
  fun x ->
    match x with
    | Epsilon -> true
    | Character (_, i) -> false
    | Sequence (_, _, p) -> p.null
    | Alternative (_, _, p) -> p.null
    | Repetition (_, p) -> p.null
    | Accept _ -> false

let (first_pos : augmented_regexp -> Int_set.t) =
  fun x ->
    match x with
    | Epsilon -> Int_set.empty
    | Character (_, i) -> Int_set.add i (Int_set.empty)
    | Alternative (_, _, p) -> p.first
    | Repetition (_, p) -> p.first
    | Sequence (_, _, p) -> p.first
    | Accept i -> Int_set.add i (Int_set.empty)

let (last_pos : augmented_regexp -> Int_set.t) =
  fun x ->
    match x with
    | Epsilon -> Int_set.empty
    | Character (_, i) -> Int_set.add i (Int_set.empty)
    | Alternative (_, _, p) -> p.last
    | Repetition (_, p) -> p.last
    | Sequence (_, _, p) -> p.last
    | Accept i -> Int_set.add i (Int_set.empty)

Our strategy in building the syntax-tree of $(e)\#$ from the syntax tree of $e$ will be to visit each node of $e$ and invoke a function to construct the corresponding node of $(e)\#$ inductively. These functions will include the generation of unique integers for the non-$\epsilon$ leaves and encode the rules for building the pos records:

  • null
    • Sequence $(e_{1}, e_{2})$ : null_pos $e_{1}$ and null_pos $e_{2}$
    • Alternative $(e_{1}, e_{2})$ : null_pos $e_{1}$ or null_pos $e_{2}$
    • Repetition : $true$
  • first
    • Alternative $(e_{1}, e_{2})$ : first_pos $e_{1} \cup$ first_pos $e_{2}$
    • Sequence $(e_{1}, e_{2})$ : if null_pos $e_{1}$ then first_pos $e_{1} \cup$ first_pos $e_{2}$ else first_pos $e_{1}$
    • Repetition $e$ : first_pos $e$
  • last
    • Alternative $(e_{1}, e_{2})$ : last_pos $e_{1} \cup$ last_pos $e_{2}$
    • Sequence $(e_{1}, e_{2})$ : if null_pos $e_{2}$ then last_pos $e_{1} \cup$ last_pos $e_{2}$ else last_pos $e_{2}$
    • Repetition $e$ : last_pos $e$

Here then are the augmented regular expression syntax-tree node constructor functions.

let (epsilon : unit -> augmented_regexp) = 
  fun () -> 

and (character : char -> augmented_regexp) = 
  fun c ->
    Character (c, generate_label ())

and (repetition : augmented_regexp -> augmented_regexp) = 
  fun e -> 
    Repetition (e, {null=true;first=first_pos e; last=last_pos e})

and (alternative : augmented_regexp -> augmented_regexp -> augmented_regexp)  = 
  fun e1 e2 ->
    Alternative (e1, e2, 
                 {null=null_pos e1 || null_pos e2;
                  first=Int_set.union (first_pos e1)(first_pos e2); 
                  last=Int_set.union (last_pos e1) (last_pos e2)})

and (sequence : augmented_regexp -> augmented_regexp -> augmented_regexp) = 
  fun e1 e2 ->
    let b1 = null_pos e1 
    and b2 = null_pos e2 in
    Sequence (e1, e2, 
              {null=b1 && b2;
                  if b1 then
                    Int_set.union (first_pos e1)(first_pos e2)
                    first_pos e1; 
                  if b2 then 
                    Int_set.union (last_pos e1) (last_pos e2)
                    last_pos e2})

let (accept : augmented_regexp -> augmented_regexp) = 
  fun e ->
    sequence e (Accept (generate_label ()))

We are now in a position to write the function that transforms a syntax tree of the regular expression $e$ into the syntax tree of the augmented regular expression $(e)\#$.

let rec (augmented_regexp : Syntax.regular_expression -> augmented_regexp) =
  fun x ->
    match x with
    | Syntax.Epsilon -> epsilon ()
    | Syntax.Character i ->  character (Char.chr i)
    | Syntax.Sequence (x, y) -> 
    (*Be very careful here. Evaluation order matters!*)
      let x' = (augmented_regexp x)
      and y' = (augmented_regexp y) in
      sequence x' y'
    | Syntax.Alternative (x, y) -> 
    (*Be very careful here. Evaluation order matters!*)
      let x' = (augmented_regexp x)
      and y' = (augmented_regexp y) in
      alternative x' y'
    | Syntax.Repetition x -> repetition (augmented_regexp x)

We can wrap all of the above up in a convenience function parse_augmented_regexp which first parses a string to build the syntax tree of the regular expression it represents and then transforms the result into the syntax tree of the corresponding augmented regular expression.

let (parse_augmented_regexp : string-> augmented_regexp * int)  =
  fun s ->
    let () = reset_label () in
    let ast = regexp_of_string s in
    let re1 = augmented_regexp ast in
    let re2 = accept re1 in
    let count = generate_label () in
    (re2, count)
Notice that this function returns a pair of the syntax-tree and the number of positions it contains.

The next step in compiling a recognizer from the expression $(e)\#$ is to compute the follow function. To do this we "unite" the information encoded by the first_pos and last_pos functions. Put plainly, follow is a function that takes each symbol (position) in the regular expression to the (set of) symbols (positions) that can follow it. The information is stored in an array of length equal to the number of symbols appearing in the regular expression. There are only two ways a position in a regular expression can follow another:

  • If $n$ is a Sequence node with left child $c_{1}$ and right child $c_{2}$, then for every position $i$ in lastpos $c_{1}$, all positions in firstpos $c_{2}$ are in follow_pos $i$
  • If $n$ is a Repition and $i$ a position in lastpos $n$, then all positions in first_pos $n$ are in follow_pos $i$
In addition to computing follow the code below also stores the association between positions and characters of the regular expression. That information goes into an array. The elements of the array have type char option since the Accept symbol has a position but no character associated with it.
let (compute_follow : Int_set.t array -> char option array -> augmented_regexp -> unit) =
  fun follow chars x ->
    let rec compute x = 
      match x with
      | Sequence (e1, e2, p) ->
        compute e1; compute e2;
        let first2 = first_pos e2 in
        let f i =
          follow.(i) <- Int_set.union first2 (follow.(i)) in
        Int_set.iter f (last_pos e1)
      | Repetition (e, p) ->
        compute e;
        let f i =
          follow.(i) <- Int_set.union (p.first) (follow.(i)) in
        Int_set.iter f (p.last)
      | Alternative (e1, e2, p) -> compute e1; compute e2
      | Epsilon -> ()
      | Accept i -> chars.(i) <- None
      | Character (c, i) -> chars.(i) <- Some c in
    compute x

Now the computation of the augmented regular expression syntax-tree and all four of the associated functions together with the mapping from positions to symbols of $\mathcal{A}$ can be wrapped up in another "high-level" convenience function.

let (regexp_follow : string -> augmented_regexp * Int_set.t array * char option array) =
  fun s ->
    let re, n = parse_augmented_regexp s in
    let follow = Array.make n (Int_set.empty) in
    let chars = Array.make n None in
    compute_follow follow chars re;
    (re, follow, chars)

We're in good shape - but a hop, skip and a jump to computing a recognizer from a regular expression. We'll leave off here on this local maxima for today and finish off the job in the next post!