Monday, December 22, 2014

Compiling regular expressions (II)

This article is a continuation of the earlier post, "Compiling regular expressions (I)".

Automata are modeled as 'state' records with two fields. The pos field contains the set of positions that are valid for recognition in the given state. Transitions are modeled as lists of pairs of symbols and states. In this way a state may contain transitions that reference itself.

type state = {
  pos : Int_set.t;
  mutable trans : (char * state) list ;
}

We will require a function that for each input symbol $a$ and a given set of positions $s$, computes the list of pairs $(a, s')$ where $s'$ is the subset of $s$ that correspond to $a$.

let (partition : char option array -> Int_set.t 
               -> (char option * Int_set.t) list) =
  fun chars s ->
    let f acc c =
      match c with
      | Some _ ->
        if List.mem_assoc c acc then acc 
        else
          let f i acc = 
            if chars.(i) <> c then acc else Int_set.add i acc in
          (c, Int_set.fold f s (Int_set.empty)) :: acc
      | None -> 
        if List.mem_assoc c acc then acc else (c, Int_set.empty) :: acc in
    List.rev (Array.fold_left f [] chars)
This function makes a list from a set of ints.
let list_of_int_set : Int_set.t -> Int_set.elt list = 
  fun s -> List.rev (Int_set.fold (fun e acc -> e :: acc) s [])
This function, accessible given a state, computes the list of sets that accessible from that state.
let (accessible : state -> Int_set.t array -> char option array 
                                      -> (char * Int_set.t) list) =
  fun s follow chars ->
    let part = partition chars s.pos in
    let f p rest =
      match p with
      | (Some c, l) -> 
        (c,
         List.fold_left 
           (Int_set.union) 
           (Int_set.empty) 
           (List.map (Array.get follow) (list_of_int_set l))
        ) :: rest
      | _ -> rest  in
    List.fold_right f part []
find_state takes a set $s$ and two lists of states (marked and unmarked). It searches for a state which has a pos field equal to $s$ and returns this state or it fails.
let (find_state : Int_set.t -> state list -> state list -> state) =
  fun s l m ->
    let test e = e.pos = s in
    try
      List.find test l
    with
    | Not_found -> List.find test m

The algorithm to compute the automata works like this. Two lists are maintained, marked and unmarked states. The algorithm is initialized such that the only state is unmarked with a pos field containing first_pos $n_{0}$ where $n_{0}$ is the root of the syntax tree; the list of transitions is empty.

For an unmarked state $st$, the algorithm does these things:

  • Calculate a set of numbers accessible from $st$. That is, a set of pairs $(c, s)$, where $c$ is a character and $s$ a set of positions. A position $j$ is accessible from $st$ by $c$ if there is an $i$ in st.pos such that $j$ is in follow $i$ and $i$ numbers the character $c$.
  • For each of the pairs $(c, s)$
    • If there exists a state st' (whether marked or unmarked) such that $s = $st'.pos, it adds $(c, st')$ to the transitions of $st$;
    • Otherwise, a new state $st'$ without transitions is created, added to the transitions of $st$, and $st'$ is added to the list of unmarked states.
  • It marks $st$.
The algorithm terminates only when there are no remaining unmarked states. The result is an array of states obtained from the list of marked states. The terminal states are all those containing the number associated with Accept. Here then is the algorithm in code.
let rec (compute_states : state list -> state list -> Int_set.t array 
                                   -> char option array -> state array) =
  fun marked unmarked follow chars ->
    match unmarked with
    | [] -> Array.of_list marked
    | st :: umsts ->
      let access = accessible st follow chars in
      let marked1 = st :: marked in
      let f (c, s) umsts =
        if Int_set.is_empty s then 
          umsts (*Suppress empty sets*)
        else
          try
            st.trans <- (c, find_state s marked1 umsts) ::st.trans ;
            umsts
          with
          | Not_found -> 
            let state1 = {pos = s; trans = []} in
            st.trans <- (c, state1) :: st.trans;
            state1 :: umsts in
      let unmarked1 = List.fold_right f access umsts in
      compute_states marked1 unmarked1 follow chars

We are just about ready to write the function to compute the automaton. It is fundamentally a call to compute_states but does one more thing. That is, it searches the resulting array for the index of the initial state and puts the index in the first slot of the array. To do this it uses the utility function array_indexq which performs the search for the index using physical equality. This is because the usual test using structural equality will not terminate on structures that loop.

let (array_indexq : 'a array -> 'a -> int) =
  fun arr e ->
    let rec loop i =
      if i = Array.length arr then
        raise (Not_found)
      else if Array.get arr i == e then i
      else loop (i + 1) in
    loop 0
So, here it is, dfa_of, the function to compute the automaton.
let (dfa_of : augmented_regexp * Int_set.t array * char option array 
                                                        -> state array) =
  fun (e, follow, chars) ->
    let init_state = {pos = first_pos e; trans = []} in
    let dfa = compute_states [] [init_state] follow chars in
    (*Installing initial state at index 0*)
    let idx_start = array_indexq dfa init_state in
    dfa.(idx_start) <- dfa.(0);
    dfa.(0) <- init_state;
    dfa

We are now on the home stretch. All that remains is to write a function to interpret the automaton. To do this, we'll make use of a mini-combinator library of recognizers. I'll not provide the OCaml code for that today - you could reverse engineer from my earlier 'Recognizers' blog-post or, consult [1].

let (interpret_dfa : state array -> int -> char Recognizer.recognizer) =
  fun dfa accept ->
    let num_states = Array.length dfa in
    let fvect = Array.make (num_states) (fun _ -> failwith "no value") in
    for i = 0 to num_states - 1 do
      let trans = dfa.(i).trans in
      let f (c, st) =
        let pc = Recognizer.recognizer_of_char c in
        let j = array_indexq dfa st in
        Recognizer.compose_and pc (fun l -> fvect.(j) l) in
      let parsers = List.map f trans in
      if Int_set.mem accept (dfa.(i).pos) then
        fvect.(i) <- compose_or_list 
                        (Recognizer.end_of_input) parsers
      else match parsers with
      | [] -> failwith "Impossible"
      | p :: ps -> fvect.(i) <- Recognizer.compose_or_list p ps
    done;
    fvect.(0)
We wrap up with a couple of high level convenience functions : compile produces a recognizer from a string representation of a regular expression and match takes a recognizer (that is, a compiled regular expression) and a string and uses the recognizer to categorize the given string as admissible or not (where explode is a simple function that transforms a string into a char list - recognizers operate on lists).
let compile xpr =
    let ((e, follow, chars) as ast) = regexp_follow xpr in
    let dfa = dfa_of ast in
    let parser = interpret_dfa dfa (Array.length chars - 1) in
    fun s -> parser (explode s)

let re_match xpr s =
  let result = xpr s in
  match result with
  | Recognizer.Remains [] -> true
  | _ -> false

Here's a simple test driver that shows how these functions can be used.

let test xpr s = 
  match re_match xpr s with
  | true -> Printf.printf "\"%s\" : success\n" s
  | false -> Printf.printf "\"%s\" : fail\n" s

let _ = 
  try 
    let xpr = compile "(a|b)*abb" in
    Printf.printf "Pattern: \"%s\"\n" "(a|b)*abb" ;
    test xpr "abb" ;
    test xpr "aabb" ;
    test xpr "baabb" ;
    test xpr "bbbbbbbbbbbbbaabb" ;
    test xpr "aaaaaaabbbaabbbaabbabaabb" ;
    test xpr "baab" ;
    test xpr "aa" ;
    test xpr "ab" ;
    test xpr "bb" ;
    test xpr "" ;
    test xpr "ccabb" ;
  with 
  | Failure msg -> print_endline msg

So that's it for this series of posts on building recognizers for regular expressions. Hope you enjoyed it!


References
[1] "The Functional Approach to Programming" - Cousineau & Mauny
[2] "Compilers Principles, Techniques & Tools" - Aho et. al.

Thursday, December 18, 2014

Compiling regular expressions (I)

This post picks up from here which was concerned with parsing - obtaining representations of regular expressions as abstract syntax trees. The ultimate goal is, given a string representation of a regular expression $e$ , produce a 'recognizer' for the expression (referred to as compiling a regular expression). That is, a function string -> bool that can be used to categorize strings as either belonging to the language $\mathcal{L_{e}}$ or not.

Having produced an abstract syntax tree for a regular expression $e$, the first step in compiling the expression is to compute an abstract syntax tree of the corresponding augmented regular expression $(e)\#$. This augmented regular expression is the original expression $e$ concatenated with a unique end-marker $\#$. For the given expression $e$, the accepting state for $e$ is given a transition on $\#$. This is a device that allows us to "forget" about accepting states as the computation of a recognizer proceeds; when the construction is complete, any state with a transition on $\#$ must be an accepting state.

Leaves in the abstract syntax tree of the augmented regular expression $(e)\#$ are labeled by $\epsilon$ or a symbol from from $\mathcal{A}$. For those non-$\epsilon$ leaves we attach a unique integer. Accordingly, we will need functions to generate unique integers (positions) that we will employ as we transform the AST of $e$ into the AST of the augmented expression $(e)\#$ leading to our first code example.

let reset_label, generate_label =
 let r = ref (-1) in
 ((fun () -> r := (-1)), (fun () -> r := !r + 1; !r))

As we construct the syntax tree of the $(e)\#$ we compute four functions : null_pos, first_pos, last_pos and following:

  1. null_pos is $true$ for a syntax-tree node $n$ if and only if the sub-expression represented by $n$ has $\epsilon$ in its language. That is, $true$ if the regular sub-expression recognizes the empty string and $false$ otherwise;
  2. first_pos is the set of positions in the sub-tree rooted at $n$ that correspond to the first symbol of at least one string in the language of the sub-expression rooted at $n$. That is, the set of symbols that can begin a string recognized by the regular sub-expression;
  3. last_pos is the set of positions in the sub-tree rooted at the syntax-tree node $n$ that corresponds to the last symbol of at least one string in the language of the sub-expression rooted at $n$. That is, the set of symbols that can terminate a string recognized by the regular sub-expression;
  4. following, for a position $p$ is the set of positions $q$ in the entire syntax-tree such that there is some string $x = a_{1}a_{2} \cdots a_{n}$ in $\mathcal{L_{(e)\#}}$ such that for some $i$, there is some way to explain the membership of $x$ in $\mathcal{L_{(e)\#}}$ by matching $a_{i}$ to $p$ and $a_{i + 1}$ to a position in $q$.
Of these, following is the last to compute as it relies upon the values of first_pos and last_pos. If the definition is confusing for now, don't worry about it. The rules for computing following will come later and will be obvious at that point. We'll focus for now on null_pos, first_pos and last_pos.

The results of first_pos, last_pos and follow are sets of integers. Accordingly, we are going to need a type to represent these.

module Int_set : Set.S with type elt = int = Set.Make (
  struct
    let compare = Pervasives.compare
    type t = int
  end)
With this we can present the type of ASTs for augmented regular expressions.
type augmented_regexp =
  | Epsilon
  | Character of char * int
  | Sequence of augmented_regexp * augmented_regexp * pos
  | Alternative of augmented_regexp * augmented_regexp * pos
  | Repetition of augmented_regexp * pos
  | Accept of int
and pos = {
  null:bool;
  first:Int_set.t;
  last:Int_set.t;
}

For a given node $n$, the values of its pos record depend only on the sub-expressions of that node. Assuming constructed augmented regular expression syntax trees, we can write null_pos, first_pos and last_pos like this.

let (null_pos : augmented_regexp -> bool)  =
  fun x ->
    match x with
    | Epsilon -> true
    | Character (_, i) -> false
    | Sequence (_, _, p) -> p.null
    | Alternative (_, _, p) -> p.null
    | Repetition (_, p) -> p.null
    | Accept _ -> false

let (first_pos : augmented_regexp -> Int_set.t) =
  fun x ->
    match x with
    | Epsilon -> Int_set.empty
    | Character (_, i) -> Int_set.add i (Int_set.empty)
    | Alternative (_, _, p) -> p.first
    | Repetition (_, p) -> p.first
    | Sequence (_, _, p) -> p.first
    | Accept i -> Int_set.add i (Int_set.empty)

let (last_pos : augmented_regexp -> Int_set.t) =
  fun x ->
    match x with
    | Epsilon -> Int_set.empty
    | Character (_, i) -> Int_set.add i (Int_set.empty)
    | Alternative (_, _, p) -> p.last
    | Repetition (_, p) -> p.last
    | Sequence (_, _, p) -> p.last
    | Accept i -> Int_set.add i (Int_set.empty)

Our strategy in building the syntax-tree of $(e)\#$ from the syntax tree of $e$ will be to visit each node of $e$ and invoke a function to construct the corresponding node of $(e)\#$ inductively. These functions will include the generation of unique integers for the non-$\epsilon$ leaves and encode the rules for building the pos records:

  • null
    • Sequence $(e_{1}, e_{2})$ : null_pos $e_{1}$ and null_pos $e_{2}$
    • Alternative $(e_{1}, e_{2})$ : null_pos $e_{1}$ or null_pos $e_{2}$
    • Repetition : $true$
  • first
    • Alternative $(e_{1}, e_{2})$ : first_pos $e_{1} \cup$ first_pos $e_{2}$
    • Sequence $(e_{1}, e_{2})$ : if null_pos $e_{1}$ then first_pos $e_{1} \cup$ first_pos $e_{2}$ else first_pos $e_{1}$
    • Repetition $e$ : first_pos $e$
  • last
    • Alternative $(e_{1}, e_{2})$ : last_pos $e_{1} \cup$ last_pos $e_{2}$
    • Sequence $(e_{1}, e_{2})$ : if null_pos $e_{2}$ then last_pos $e_{1} \cup$ last_pos $e_{2}$ else last_pos $e_{2}$
    • Repetition $e$ : last_pos $e$

Here then are the augmented regular expression syntax-tree node constructor functions.

let (epsilon : unit -> augmented_regexp) = 
  fun () -> 
    Epsilon

and (character : char -> augmented_regexp) = 
  fun c ->
    Character (c, generate_label ())

and (repetition : augmented_regexp -> augmented_regexp) = 
  fun e -> 
    Repetition (e, {null=true;first=first_pos e; last=last_pos e})

and (alternative : augmented_regexp -> augmented_regexp -> augmented_regexp)  = 
  fun e1 e2 ->
    Alternative (e1, e2, 
                 {null=null_pos e1 || null_pos e2;
                  first=Int_set.union (first_pos e1)(first_pos e2); 
                  last=Int_set.union (last_pos e1) (last_pos e2)})

and (sequence : augmented_regexp -> augmented_regexp -> augmented_regexp) = 
  fun e1 e2 ->
    let b1 = null_pos e1 
    and b2 = null_pos e2 in
    Sequence (e1, e2, 
              {null=b1 && b2;
               first=
                  if b1 then
                    Int_set.union (first_pos e1)(first_pos e2)
                  else 
                    first_pos e1; 
               last=
                  if b2 then 
                    Int_set.union (last_pos e1) (last_pos e2)
                  else 
                    last_pos e2})

let (accept : augmented_regexp -> augmented_regexp) = 
  fun e ->
    sequence e (Accept (generate_label ()))

We are now in a position to write the function that transforms a syntax tree of the regular expression $e$ into the syntax tree of the augmented regular expression $(e)\#$.

let rec (augmented_regexp : Syntax.regular_expression -> augmented_regexp) =
  fun x ->
    match x with
    | Syntax.Epsilon -> epsilon ()
    | Syntax.Character i ->  character (Char.chr i)
    | Syntax.Sequence (x, y) -> 
    (*Be very careful here. Evaluation order matters!*)
      let x' = (augmented_regexp x)
      and y' = (augmented_regexp y) in
      sequence x' y'
    | Syntax.Alternative (x, y) -> 
    (*Be very careful here. Evaluation order matters!*)
      let x' = (augmented_regexp x)
      and y' = (augmented_regexp y) in
      alternative x' y'
    | Syntax.Repetition x -> repetition (augmented_regexp x)

We can wrap all of the above up in a convenience function parse_augmented_regexp which first parses a string to build the syntax tree of the regular expression it represents and then transforms the result into the syntax tree of the corresponding augmented regular expression.

let (parse_augmented_regexp : string-> augmented_regexp * int)  =
  fun s ->
    let () = reset_label () in
    let ast = regexp_of_string s in
    let re1 = augmented_regexp ast in
    let re2 = accept re1 in
    let count = generate_label () in
    (re2, count)
Notice that this function returns a pair of the syntax-tree and the number of positions it contains.

The next step in compiling a recognizer from the expression $(e)\#$ is to compute the follow function. To do this we "unite" the information encoded by the first_pos and last_pos functions. Put plainly, follow is a function that takes each symbol (position) in the regular expression to the (set of) symbols (positions) that can follow it. The information is stored in an array of length equal to the number of symbols appearing in the regular expression. There are only two ways a position in a regular expression can follow another:

  • If $n$ is a Sequence node with left child $c_{1}$ and right child $c_{2}$, then for every position $i$ in lastpos $c_{1}$, all positions in firstpos $c_{2}$ are in follow_pos $i$
  • If $n$ is a Repition and $i$ a position in lastpos $n$, then all positions in first_pos $n$ are in follow_pos $i$
In addition to computing follow the code below also stores the association between positions and characters of the regular expression. That information goes into an array. The elements of the array have type char option since the Accept symbol has a position but no character associated with it.
let (compute_follow : Int_set.t array -> char option array -> augmented_regexp -> unit) =
  fun follow chars x ->
    let rec compute x = 
      match x with
      | Sequence (e1, e2, p) ->
        compute e1; compute e2;
        let first2 = first_pos e2 in
        let f i =
          follow.(i) <- Int_set.union first2 (follow.(i)) in
        Int_set.iter f (last_pos e1)
      | Repetition (e, p) ->
        compute e;
        let f i =
          follow.(i) <- Int_set.union (p.first) (follow.(i)) in
        Int_set.iter f (p.last)
      | Alternative (e1, e2, p) -> compute e1; compute e2
      | Epsilon -> ()
      | Accept i -> chars.(i) <- None
      | Character (c, i) -> chars.(i) <- Some c in
    compute x

Now the computation of the augmented regular expression syntax-tree and all four of the associated functions together with the mapping from positions to symbols of $\mathcal{A}$ can be wrapped up in another "high-level" convenience function.

let (regexp_follow : string -> augmented_regexp * Int_set.t array * char option array) =
  fun s ->
    let re, n = parse_augmented_regexp s in
    let follow = Array.make n (Int_set.empty) in
    let chars = Array.make n None in
    compute_follow follow chars re;
    (re, follow, chars)

We're in good shape - but a hop, skip and a jump to computing a recognizer from a regular expression. We'll leave off here on this local maxima for today and finish off the job in the next post!

Saturday, December 13, 2014

Recognizers

In my last post, I gave an OCaml program to parse regular expressions. I intend however to show you, over the coming weeks, not just how to parse them, but how to compile them to recognizers too. Doing so will require me to share quite a lot of code that I gleaned from the book The Functional Approach to Programming by Guy Cousineau and Michel Mauny from which I learned how to do this. In doing so, I will along the way, provide updated code in modern OCaml as the book is presented in the Caml language a predecessor of today's OCaml, and not everywhere equivalent. Additionally, there will be functions of my own "filling in" gaps left as exercises in the book. That said, I don't think there's enough effort of my own to justify "plagiarizing" the material so blatantly so am determined to add at least a little value, where my limited skills allow, in order that I might redeem myself.

So today, let us start with a minimal combinator library implementing "recognizers" for recursive descent parsers, based on Cousineau and Mauny transliterated to the C++ programming language. I hope on reading the following, that you will agree, the multi-paradigm language C++-11 admits "The Functional Approach to Programming" as a "first class citizen" and the result here might be viewed as the beginning of a mini Boost.Spirit! Let's begin.

We are concerned with the problem of "recognizing" phrases belonging to a given language. We will produce programs (functions) that accept character strings as input which decide to accept or reject them. First some headers.

#include <boost/variant.hpp>
#include <boost/variant/apply_visitor.hpp>
#include <boost/utility.hpp>
#include <boost/range.hpp>
#include <boost/detail/lightweight_test.hpp>

#include <list>
#include <functional>
#include <numeric>
#include <string>

Recognizers work on lists. When recognition succeeds, part of the list is consumed and the part of the list remaining is returned or, recognition fails. So the type remaining<A> is a sum type with two cases, where A is the type of "tokens" (symbols) contained by the list (typically but not necessarily char).

//Recognition succeeded
template <class A> 
struct remains{ 
  std::list<A> left;
  template <class ItT>
    remains (ItT begin, ItT end) 
      : left (begin, end) 
    {}
};

//Recognition failed
template <class A>
struct recognition_fails {};

//Result of a recognizer. Indicates the result of attempting to
//recognize something from a list
template <class A> using remaining = 
  boost::variant<remains <A>, recognition_fails<A> >;
Recognizers then are but functions from lists of tokens to values of remaining<A>.
//A 'recognizer' is a function from a list to a value of remaining<A>
template <class A> using recognizer = 
  std::function<remaining<A>(std::list<A> const&)>;
Here's an example. This function (constant) produces a recognizer that matches the empty string. The recognizer it produces always succeeds and never consumes any input.
//A recognizer that recognizes the empty string. It always succeeds and
//no input is ever consumed
template <class A>
recognizer<A> epsilon ()
{
  return [](std::list<A> const& cs) -> remaining<A>
    {
      return remains<A> (cs.begin (), cs.end ());
    };
}
This next factory function produces recognizers that recognize elements that satisfy a user provided predicate.
//Given a predicate, 'recognizer_of_token' produces the recognizer
//associated with the elements that satisfy this predicate
template <class A, class F/*bool(A)*/>
recognizer<A> recognizer_of_token (F test)
{
  return
    [=] (std::list<A> const& cs) -> remaining<A>
    {
      if (cs.empty ())
        return recognition_fails<A> ();

      if (test (cs.front ()))
        return remains<A> (boost::next (cs.begin()), cs.end());

      return recognition_fails<A>();
    };
}
Recognizers can be composed. This function takes two recognizers and returns a recognizer that will, when presented with a list of tokens, attempt recognition via the first and if that doesn't work out, backtrack and attempt to recognize via the second. It's the "if then else" of recognizers.
//Recognizer disjunction
template <class A>
recognizer<A> compose_or (recognizer<A> p, recognizer<A> q)
{
  return [=](std::list<A> const& toks) -> remaining<A>
    {
      remaining <A> res = p (toks);
      if (remains<A>* rem = boost::get<remains<A> >(&res)) return *rem;

      return q (toks);
    };
}
This next function is the builder of the corresponding "and then" sort of recognizer.
//Recognizer conjunction
template <class A>
recognizer<A> compose_and (recognizer<A> p, recognizer<A> q)
{
  return [=](std::list<A> const& toks) -> remaining<A>
    {
      remaining <A> res = p (toks);
      if (remains<A>* rem = boost::get<remains<A> >(&res)) 
        return q (rem->left);

      return recognition_fails<A> ();
    };
}
With disjunction and conjunction of recognizers, we can implement a function to compute, given a recognizer, the corresponding Kleene star recognizer. That is, zero or more occurrences of a given phrase.
//The '*' iterator (Kleene star)
template <class A>
recognizer<A> zero_or_more (recognizer<A> p)
{
  return [=] (std::list<A> const& toks) -> remaining <A>
    {
      return compose_or (
         compose_and (p, zero_or_more<A> (p)) , epsilon<A> ()) (toks);
    };
}
Well, let's now get concrete. Here's a function to produce a recognizer of a specific token.
//A function to produce a recognizer that recognizes only the
//given character
template <A>
recognizer<A> char_ (A c)
{
  return recognizer_of_token<A>([=](A x) -> bool { return c == x; });
}
The composite recognizer for a given "string" of tokens can be constructed then like this.
//A function to produce a recognizer of a specific string
template <class C>
recognizer<typename boost::range_value<C>::type> string_ (C s)
{
  typedef typename boost::range_value<C>::type value;

  std::list <recognizer<value> > rs;

  typedef std::back_insert_iterator<std::list<recognizer<value> > > it_t;

  std::accumulate (
      boost::begin (s)
    , boost::end (s)
    , std::back_inserter (rs)
    , [](it_t dst, value c) -> it_t 
      { *dst++ = char_ (c); return dst; });

  return std::accumulate (
      boost::next (rs.begin ())
    , rs.end ()
    , rs.front ()
    , [](recognizer<value> acc, recognizer<value> r) -> recognizer<value>
      { return compose_and (acc, r); });
}
That will do for our mini-recognizer library for today. Let's turn attention to testing. First some utilities.
//Match on a remaining<A> returns 'true' if it contains a 'remains<A>'
//value with all input consumed, 'false' if it contains a 'recognition_fails<A>' value
template <class A>
struct accept_visitor
{
  typedef bool result_type;
  bool operator () (remains<A> const& r) const { return r.left.empty (); }
  bool operator () (recognition_fails<A> const& r) const { return false; }
};

//Function to determine if recognition was achieved
template <class A>
bool accept (remaining<A> const& r)
{
  return boost::apply_visitor( accept_visitor<A> (), r);
}

//Test if the provided recognizer can recognize the given string
bool parse (recognizer<char> parser, std::string const& s)
{
  return accept (parser (std::list<char>(s.begin (), s.end())));
}
Now, a little test.
int main ()
{
  //char_
  BOOST_TEST( parse (char_ ('a'), "a") );
  BOOST_TEST( !parse (char_ ('b'), "a") );
  BOOST_TEST( parse (char_ ('b'), "b") );

  //'*'
  BOOST_TEST( parse (zero_or_more (char_ ('a')), "aa") );
  BOOST_TEST( parse (zero_or_more (char_ ('a')), "") );
  BOOST_TEST( !parse (zero_or_more (char_ ('a')), "ab") );

  //string_
  BOOST_TEST (parse (string_ (std::string ("foo")), "foo") );
  BOOST_TEST (!parse (string_ (std::string ("foo")), "bar") );
  
  return boost::report_errors ();
}
Et voilà. L' approche fonctionnelle de C++! See you soon for more about regular expressions. Happy holidays!

Sunday, December 7, 2014

Parsing regular expressions

Here's a brief refresher on regular expressions derived from Cosineau and Mauny.

Given an alphabet $\mathcal{A}$, a regular expression $e$ over $\mathcal{A}$ is a concise way to describe a set $\mathcal{L}$ of strings of elements of $\mathcal{A}$. The set $\mathcal{L}_{e}$ denoted by an expression $e$ is termed the regular language denoted by $e$.

In their most basic form, regular expressions can be defined like this:

  • $\epsilon$ represents the set $\mathcal{L}_{\epsilon} = \left\{\epsilon\right\}$ containing only the empty string
  • A symbol $c$ of $\mathcal{A}$ denotes $\mathcal{L}_{c} = \left\{c\right\}$
  • $e_{1}e_{2}$ is the set $\mathcal{L}_{e_{1}}\cdot\mathcal{L}_{e_{2}} = \left\{s_{1}s_{2} | s_{1} \in \mathcal{L}_{e_{1}}, s_{2} \in \mathcal{L}_{e_{2}}\right\}$
  • $e_{1}|e_{2}$ is the set $\mathcal{L}_{e_{1}}\bigcup \mathcal{L}_{e_{2}}$
  • $e^{*}$ is the union $\bigcup_{i=0}^{\infty}\mathcal{L}_{e}^{i}$ where $\mathcal{L}_{e}^{0} = \mathcal{L}_{\epsilon}$ and $\mathcal{L}_{e}^{i} = \mathcal{L}_{e}^{i-1}\cdot\mathcal{L}_{e}$
This last operator $*$ is called the Kleene closure operator and $e^{*}$ equivalent to the infinite regular expression $\epsilon|e|ee|eee|\dots$

Regular expressions provide a convenient syntax for searching plain text data for lines matching specific patterns as demonstrated by the venerable Unix function 'grep' (I was taught grep is an acronym for generalized regular expression pattern-matching). To produce a function like grep requires a program for analyzing a regular expression $e$ and computing a recognizer for $\mathcal{L}_{e}$, that is a function string -> bool, that can be used to classify a string $s$ (say) in terms of whether or not $s \in \mathcal{L}_{e}$. This procedure of computing a recognizer for $\mathcal{L}_{e}$ from $e$ is called compiling a regular expression.

Writing a regular expression compiler is not for the faint-of-heart! Doing so requires more work than can be reasonably described in one blog-post so here, I'll focus only on the first part - parsing a regular expression. That is, given a string $s$ a program to produce an abstract syntax tree describing the regular expression $e$ represented by $s$.

To begin, we define the type of regular expressions.

type regular_expression =
  | Epsilon
  | Character of int
  | Sequence of regular_expression * regular_expression
  | Alternative of regular_expression * regular_expression
  | Repetition of regular_expression
We could proceed from here by hand-crafting a recursive descent parser or we might take advantage of a parser generator system like ocamllex/ocamlyacc which is the choice I favored. On reaching that decision it occurred to me that, since ocamllex itself treats regular expressions as inputs, the source code for ocamllex itself might contain a suitable grammar for the expression type above and indeed it does! Things at this point become marvelously recursive - to build a regular expression compiler we can bootstrap (at both the source and binary levels) from an existing regular expression compiler (which itself is a bootstrap by the way)!

The code that follows is an adaptation of the parser and lexer input files for ocamllex (the copyright of Xavier Leroy).

/*parser.mly*/

%{
open Syntax
%}

%token <int> Tchar

%token Tend Tor Tlbracket Trbracket
%token Tstar Tmaybe Tplus Tlparen Trparen

%left Tor
%nonassoc CONCAT
%nonassoc Tmaybe Tstar Tplus
%nonassoc Tchar Tlbracket Tlparen

%start main
%type <Syntax.regular_expression> main
%%

main:
 | regexp Tend
     { $1 }
 ;
regexp:
 | Tchar
     { Character $1 }
 | regexp Tstar
     { Repetition $1 }
 | regexp Tmaybe
     { Alternative (Epsilon, $1) }
 | regexp Tplus
     { Sequence (Repetition ($1), $1)}
 | regexp Tor regexp
     { Alternative ($1, $3) }
 | regexp regexp %prec CONCAT
     { Sequence ($1, $2) }
 | Tlparen regexp Trparen
     { $2 }
 ;
Here is the corresponding lexer input file.
{
open Parser

let incr_loc lexbuf delta =
  let pos = lexbuf.Lexing.lex_curr_p in
    lexbuf.Lexing.lex_curr_p <- { pos with
    Lexing.pos_lnum = pos.Lexing.pos_lnum + 1;
    Lexing.pos_bol = pos.Lexing.pos_cnum - delta;
  }

let char_for_backslash = function
  | 'n' -> '\010'
  | 'r' -> '\013'
  | 'b' -> '\008'
  | 't' -> '\009'
  | c   -> c

}

let special_chars = 
  ['|' '*' '?' '+' '(' ')']

let backslash_escapes =
  ['\\' '\'' '"' 'n' 't' 'b' 'r' ' ']

rule main = parse
    [' ' '\013' '\009' '\012' ] +
    { main lexbuf }
  | '\010'
    { incr_loc lexbuf 0;
      main lexbuf }
  | [^ '\\' '|' '*' '?' '+' '(' ')']
    { Tchar(Char.code (Lexing.lexeme_char lexbuf 0)) }
  | '\\' backslash_escapes
    { Tchar (Char.code (char_for_backslash (Lexing.lexeme_char lexbuf 1))) }
  | '\\' special_chars
    { Tchar (Char.code (Lexing.lexeme_char lexbuf 1)) }
  | '|'  { Tor }
  | '*'  { Tstar }
  | '?'  { Tmaybe }
  | '+'  { Tplus }
  | '('  { Tlparen }
  | ')'  { Trparen }
  | eof  { Tend }

The function to produce an abstract syntax tree from a string then goes like this.
let (regexp_of_string : string -> Syntax.regular_expression) =
  fun s ->
    let parse_buf lexbuf =
      try 
        Parser.main Lexer.main lexbuf
      with 
      | Parsing.Parse_error ->
        begin
          let curr = lexbuf.Lexing.lex_curr_p in
          let line = curr.Lexing.pos_lnum in
          let cnum = curr.Lexing.pos_cnum - curr.Lexing.pos_bol in
          let tok = Lexing.lexeme lexbuf in
          raise (Failure
                   (Printf.sprintf 
                    "file \"\", line %d, character %d\n\
                     Error : Syntax error \"%s\"" line cnum tok
                 ))
        end
    in parse_buf (Lexing.from_string s)

Now, the obvious definition for the symmetric string_of_regexp function.

let rec (string_of_regexp : Syntax.regular_expression -> string) = 
  fun re ->
    match re with
    | Syntax.Epsilon -> 
      "Epsilon"
    | Syntax.Character c -> 
      Printf.sprintf "Character '%c'" (Char.chr c)
    | Syntax.Sequence (p, q) -> 
      Printf.sprintf "Sequence (%s, %s)"
         (string_of_regexp p) (string_of_regexp q)
    | Syntax.Alternative (p, q) -> 
      Printf.sprintf "Alternative (%s, %s)" 
         (string_of_regexp p) (string_of_regexp q)
    | Syntax.Repetition r -> 
      Printf.sprintf "Repetition (%s)" (string_of_regexp r)

With these, a quick little program to glimpse what the AST for an expression like $\left(a|b\right)^*abb$ looks like is easy.
Sequence (
  Repetition (
    Alternative (
      Character 'a',
      Character 'b'
    )
  ), 
  Sequence (
    Character 'a', 
    Sequence (
      Character 'b', 
      Character 'b'
    )
  )
)

Thursday, November 13, 2014

Dimensional Analysis in OCaml

Dimensional analysis in OCaml

In 1994, Barton and Nackman in their book 'Scientific Engineering in C++' [1] demonstrated how one could encode the rules of dimensional analysis into the C++ type system enabling compile-time checking (no run-time cost) of the plausibility (at least up to the dimensional correctness) of computations.

In 2004, Abrahams & Gurtovy in 'C++ Template Metaprogramming' [2] showed the Barton Nackman technique to be elegantly implementable using compile time type sequences encoding integer constants. The key properties of the technique are:

  • Encoding of integers as types;
  • Compile time manipulation of sequences of these integer encodings to deduce/produce new derived types.

For a good while there it escaped me how to approach this problem in OCaml and it bothered me somewhat. I turned to the caml-list for guidance and I'm happy to say some notable 'Camlers' there helped me out (thank-you Octachron, Mario Alvarez Picallo, Thomas Gazagnaire, Roberto Di Cosmo and David Mentre)!

The key idea in the solution to follow is the encoding of integers into the type-system as differences between two Peano numbers. The details of the approach are presented in the excellent paper "Many Holes in Hindley-Milner" by Sam Lindley of the University of Edinburgh.

Credit for the code that follows goes to Mario Alvarez Picallo. I generalized Mario's program to the extent that I could do the "force on a laptop" exercise (as presented in the online Boost.MPL tutorial).

The module interface is where all the work is - getting the "type-math" correct.

module type S = sig

  type +'a s = 'a * 'a
  type (+'a,+'b,+'c,+'d,+'e,+'f,+'g,+'h,+'i,+'j,+'k,+'l,+'m,+'n) t

  (*Base dimensions*)

  val mass : 
    float -> ('a,'a s,'b,'b,'c,'c,'d,'d,'e,'e,'f,'f,'g,'g) t
  val length : 
    float -> ('a,'a,'b,'b s,'c,'c,'d,'d,'e,'e,'f,'f,'g,'g) t
  val time : 
    float -> ('a,'a,'b,'b,'c,'c s,'d,'d,'e,'e,'f,'f,'g,'g) t
  val charge : 
    float -> ('a,'a,'b,'b,'c,'c,'d,'d s,'e,'e,'f,'f,'g,'g) t
  val temperature : 
    float -> ('a,'a,'b,'b,'c,'c,'d,'d,'e,'e s,'f,'f,'g,'g) t
  val intensity : 
    float -> ('a,'a,'b,'b,'c,'c,'d,'d,'e,'e,'f,'f s,'g,'g) t
  val angle :
     float -> ('a,'a,'b,'b,'c,'c,'d,'d,'e,'e,'f,'f,'g,'g s) t

  (*Composite dimensions*)

  val velocity :
    float -> ('a,'a,'b,'b s,'c s,'c,'d,'d,'e,'e,'f,'f,'g,'g) t
  val acceleration :
    float -> ('a,'a,'b,'b s,'c s s,'c,'d,'d,'e,'e,'f,'f,'g,'g) t
  val momentum :
    float -> ('a,'a s,'b,'b s,'c s,'c,'d,'d,'e,'e,'f,'f,'g,'g) t
  val force :
    float -> ('a,'a s,'b,'b s,'c s s,'c,'d,'d,'e,'e,'f,'f,'g,'g) t

  (*Arithmetic*)

  val ( + ) : 
    ('a,'b,'c,'d,'e,'f,'g,'h,'i,'j,'k,'l,'m,'n) t -> 
      ('a,'b,'c,'d,'e,'f,'g,'h,'i,'j,'k,'l,'m,'n) t -> 
        ('a,'b,'c,'d,'e,'f,'g,'h,'i,'j,'k,'l,'m,'n) t
  val ( - ) :
    ('a,'b,'c,'d,'e,'f,'g,'h,'i,'j,'k,'l,'m,'n) t -> 
      ('a,'b,'c,'d,'e,'f,'g,'h,'i,'j,'k,'l,'m,'n) t -> 
        ('a,'b,'c,'d,'e,'f,'g,'h,'i,'j,'k,'l,'m,'n) t
  val ( * ) :
    ('a0,'b0,'c0,'d0,'e0,'f0,'g0,'h0,'i0,'j0,'k0,'l0,'m0,'n0) t -> 
      ('b0,'b1,'d0,'d1,'f0,'f1,'g0,'h1,'i0,'j1,'k0,'l1,'m0,'n1) t -> 
        ('a0,'b1,'c0,'d1,'e0,'f1,'g0,'h1,'i0,'j1,'k0,'l1,'m0,'n1) t
  val ( / ) :
    ('a0,'b0,'c0,'d0,'e0,'f0,'g0,'h0,'i0,'j0,'k0,'l0,'m0,'n0) t -> 
      ('a1,'b0,'c1,'d0,'e1,'f0,'g1,'h0,'i1,'j0,'k1,'l0,'m1,'n0) t -> 
        ('a0,'a1,'c0,'c1,'e0,'e1,'g0,'g1,'i0,'i1,'k0,'k1,'m0,'m1) t

  (*Conversion to float*)

  val value : ('a,'b,'c,'d,'e,'f,'g,'h,'i,'j,'k,'l,'m,'n) t -> float
end

That's the hard part, the module implementation itself is trivial.

module Dim : S = struct

  type +'a s = 'a * 'a
  type (+'a,+'b,+'c,+'d,+'e,+'f,+'g,+'h,+'i,+'j,+'k,+'l,+'m,+'n) t = float

  let mass x = x
  let length x = x
  let time x = x
  let charge x = x
  let temperature x = x
  let intensity x = x
  let angle x = x

  let velocity x = x
  let acceleration x = x
  let momentum x = x
  let force x = x

  let ( + ) = ( +. )
  let ( - ) = ( -. )
  let ( * ) = ( *. )
  let ( / ) = ( /. )

  let value x = x

end

And the motivating "force on a laptop" calculation? Well in the top-level it proceeds like this.

# open Dim ;;
# let m = mass 5.0 ;;
val m : ('a,'a Dim.s,'b,'b,'c,'c,'d,'d,'e,'e,'f,'f,'g,'g) Dim.t =
  <abstr>
# let a = acceleration 9.8 ;;
val a :
  ('a,'a,'b,'b Dim.s,'c Dim.s Dim.s,'c,'d,'d,'e,'e,'f,'f,'g,'g)
  Dim.t = <abstr>
# let f = m * a ;;
val f :
  ('a,'a Dim.s,'b,'b Dim.s,'c Dim.s Dim.s,'c,'d,'d,'e,'e,'f,'f,'g,'g)
  Dim.t = <abstr>
Now to verify the result.
# let m2 = f / a ;;
val m2 :
  ('a,'a Dim.s,'b,'b,'c Dim.s Dim.s,'c Dim.s Dim.s,'d,'d,'e,'e,'f,'f,'g,'g)
  Dim.t = <abstr>
If we got things right, then we'd expect that the difference m2 - m be close to zero (within rounding error).
# value (m2 - m) ;;
- : float = 0.
Indeed it is as we hoped.

The key test though is this, if we had written a/f instead of f/a we want that there be type-check failure preventing the mistake from propagating through the program.

# let m2 = a / f (*oops*) ;;
val m2 :
  ('a Dim.s,'a,'b,'b,'c Dim.s Dim.s,'c Dim.s Dim.s,'d,'d,'e,'e,'f,'f,'g,'g)
  Dim.t = 
# m2 - m ;;
Characters 5-6:
  m2 - m ;;
       ^
Error:
  This expression has type
   ('a Dim.s,'a Dim.s Dim.s,'b,'b,'c,'c,'d,'d,'e,'e,'f,'f,'g,'g) Dim.t
  but an expression was expected of type
   ('a Dim.s,'a,'h,'h,'i Dim.s Dim.s,'i Dim.s Dim.s,'j,'j,'k,'k,'l,'l,'m,'m) Dim.t
  The type variable 'a occurs inside 'a Dim.s * 'a Dim.s
And there it is. Happy days!

[1] John J. Barton and Lee R. Nackman. Scientific and Engineering C++: an Introduction with Advanced Techniques and Examples. Reading, MA: Addison Wesley. ISBN 0-201-53393-6. 1994.

[2] David Abrahams and Aleksey Gurtovy C++ Template Metaprogramming: Concepts, Tools, and Techniques from Boost and Beyond (C++ in Depth Series), Addison-Wesley Professional. ISBN:0321227255. 2004.

Sunday, November 9, 2014

Complexity

Complexity

When you begin to program in OCaml, one of the first pieces of advice you encounter is to prefer the '::' operator over the '@' operator for list construction. The question is, does it matter? Even knowing that '@' exhibits complexity $O(N)$ as opposed to $O(1)$ for '::' should you care? I mean, how much of a difference can it make really?

The answer of course is yes. In practical terms, it matters. No, it really, really matters. Let's quantify it. Here's a version of the range function that uses '@'.

let range s e =
  let rec loop acc s e =
    if s >= e then acc
    else loop (acc @ [s]) (s + 1) e 
  in loop [] s e
As an aside, you'll note that it's been written to be tail-recursive (so as to avoid inducing stack overflow).

Timing this function for building lists of $2^{10} = 1,024$ elements to $2^{16} = 65,536$ elements produces the following table:

ntime (s)
100.011243
110.047308
120.197137
130.866350
144.130998
1522.769691
16142.506546

Ouch! To build a list of just $65,536$ elements, the program takes over 2 minutes!?!

Ok, here's the version of range that uses '::' to build the list (and necessarily does a List.rev on the result in order that the elements don't come out "back to front"):

let range s e =
  let rec loop acc s e =
    if s >= e then acc
    else loop (s :: acc) (s + 1) e 
  in List.rev (loop [] s e)

And the timings for this one?

ntime (s)
100.000037
110.000097
120.000143
130.000324
140.002065
150.001195
160.005341

That is, this one builds the list of $65,536$ elements in just $5$ milliseconds!

Convinced?

Sunday, October 19, 2014

Tail-recursion

Tail-recursion

Stack overflow refers to a condition in the execution of a computer program whereby the stack pointer exceeds the address space allocated for the stack. The usual result of "blowing the stack" is swift and brutal abnormal termination of the program.

The amount of memory allocated by the operating system for a given program's stack is finite and generally little can be done by the programmer to influence the amount that will be made available. The best the programmer can really do is to use what's given wisely.

We can get a sense of the limits of the stack in practical terms with a program like the following.

let rec range s e = 
  if s >= e then [] 
  else s :: range (s + 1) e

let rec loop i =
  let n = 2.0 ** (i |> float_of_int) |> int_of_float in
  try
    let _ = range 0 n in
    loop (i + 1)
  with
  | Stack_overflow -> 
    Printf.printf "Stack overflow at i = %d, n = %d\n" i n
    
let () = loop 0
range is a function that produces the half-open range $\left[s, e\right)$ - the ordered sequence $\left\{s, s + 1, s + 2, \dots, e - 2, e - 1\right\}$. Note that range is defined in terms of itself, that is, it is a recursive function. The idea is to use it in an unbounded loop to build sequences of increasing lengths of powers of $2$ : ${2^0, 2^1, 2^2, \dots}$. We set it off and when we encounter stack overflow, terminate the program gracefully reporting on the power of $2$ found to give rise to the condition. In my case I found that I was able to make $\approx 2^{19} = 524,288$ recursive calls to range before the stack limit was breached. That's very limiting. For realistic programs, one would hope to be able to produce sequences of lengths much greater than that!

What can be done? The answer lies in the definition of range and that thing called tail-recursion. Specifically, range is not tail-recursive. To be a tail-recursive function, the last thing the function needs do is to make a recursive call to itself. That's not the case for range as written as the recursive call to itself is the second to last thing it does before it returns (the last thing it does is to 'cons' a value onto the list returned by the recursive call).

Why being tail-recursive is helpful is that tail-calls can be implemented by the compiler without requiring the addition of a new "stack frame" to the stack. Instead, the current frame can be replaced in setting up the tail-call being modified as necessary and effectively the recursive call is made to be a simple jump. This is called tail-call elimination and its effect is to allow tail-recursive functions to circumvent stack overflow conditions.

Here's a new definition for range, this time implemented with tail-calls.
let range s e = 
  let rec aux acc s e = 
    if s >= e then acc
  else aux (s :: acc) (s + 1) e
  in List.rev (aux [] s e)
With this definition for range I find I can build sequences of length up to around $\approx 2^{26} = 67,108,864$ elements long without any sign of stack overflow which is a huge improvement! At around this point though, my sequence building capabilities start to be limited by the amount of physical memory present on my PC but that's a different story entirely.

Sunday, October 12, 2014

Dimensional analysis (and the units of the universal gas constant 'R')

Dimensional analysis

The problem at hand is to find by dimensional analysis, the SI units of the universal gas constant $R$ (forgive me - whilst this entry is not explicitly about computer programming - it is in fact one of my daughter's homework problems - the obvious relationship to type systems makes it seem to me at least tangentially relevant).

$R$ is defined by the Ideal Gas Law: $PV = nRT$ were $P$ is the absolute pressure of the gas, $V$ is the volume of the gas, $n$ is the amount of substance of gas (measured in moles), and $T$ is the absolute temperature of the gas.

The obvious dimensions are as follows :

$\left[P\right]$: $M \cdot L^{-1}\cdot T^{-2}$, $\left[V\right]$: $L^3$ and $\left[T\right]$: $\Theta$.

Now, one mole of a substance is defined to be $6.0221367\times 10^{23}$ atoms of that substance (Avogardro's number) but even dimensionless numbers can be part of a dimensioned system. The trick is to realize that if one quantity in an equation is "per mole" then so too must be any quantity added to it. Accordingly, if we define a (pseudo) dimension $\Lambda$ for the amount $n$ we can reason that $\left[R\right]$: $M \cdot L^{2} \cdot T^{-2} \cdot \Theta^{-1} \cdot \Lambda^{-1}$. This is enough for us to say the fundamental units for $R$ are

$kg \cdot m^{2} \cdot s^{-2} \cdot K^{-1} \cdot mol^{-1}$.

We can go a little further though. Since $work = force \times length$ we see that $M \cdot L \cdot T^{-2}$ can be expressed in units of energy and indeed $1J = kg \cdot m^{2} \cdot s^{-2}$. Thus we arrive at our final conclusion. $R$ can be written with units

$J \cdot K^{-1} \cdot mol^{-1}$.

The beautiful thing though is this. The physical interpretation of the ideal gas law is saying that, for an ideal gas, of any kind, "the energy per degree per mole" is a constant (that constant being $\approx 8.3144 \frac{^J/_K}{mol}$)!

Friday, October 3, 2014

Recursive list

Recursive list

Way, way back we looked at a function to estimate the value of the sin function utilizing its representation as a Taylor polynomial about x = 0. When one looks closely at that Taylor polynomial, one can observe a pattern in the coefficients involving the infinite sequence 0, 1, 0, -1, 0, 1, 0, -1, .... My buddy Marcelo in his formulation of that Taylor polynomial wrote this function that for a given n produces the first n values of this sequence as a list:

let rec sin_coeff n =
  match n with
  | 0 -> []
  | 1 -> [0.]
  | 2 -> [0.; 1.]
  | 3 -> [0.; 1.; 0. ]
  | _  -> 0. :: 1. :: 0. :: -1. :: sin_coeff (n-4)
He mused to me on the day that "surely there is a more elegant way to produce this list?". Oh, by golly Marcelo there certainly is!
let rec sin_coeff = 0 :: 1 :: 0 :: -1 :: sin_coeff
This little construction builds the list value sin_coeff recursively (that is, in terms of itself). Now, of course it's not impossible to define structures like this in C or C++ but good luck matching the brevity and elegance afforded to us in OCaml for this sort of thing! We still need a little function of course that can pull off this endless list a list containing just the first n elements. The ubiquitous take function will do for this purpose and I show it here just for completeness.
let rec list_take k l =
  if k <= 0 then []
  else
    match l with
    | [] -> []
    | (x :: xs)  -> x :: (list_take (k-1) xs)

Sunday, September 21, 2014

Correlation Coefficient

Correlation coefficient

It's been a while since we looked at anything from the domain of statistics so here's another little bite-sized piece - a function to compute Pearson's "product moment correlation coefficient".

It's a measure of dependence between two data sets. We'll express it in terms of unbiased standard deviation which I didn't write out before so I'll include that function too.

let unbiased_standard_deviation t =
  (*http://en.wikipedia.org/wiki/Unbiased_estimation_of_standard_deviation

    In statistics and in particular statistical theory, unbiased
    estimation of a standard deviation is the calculation from a
    statistical sample of an estimated value of the standard deviation
    (a measure of statistical dispersion) of a population of values,
    in such a way that the expected value of the calculation equals
    the true value.

  *)
 let av = arithmetic_mean t in
 let squared_diffs =
   List.fold_left (fun acc xi -> ((xi -. av) *. (xi -. av)) :: acc) [] t
 in sqrt ((sum squared_diffs)/.((float_of_int (List.length t)) -. 1.0))

let correlation_coefficient x y =
  (*http://en.wikipedia.org/wiki/Correlation_and_dependence

    The most familiar measure of dependence between two quantities is
    the Pearson product-moment correlation coefficient, or "Pearson's
    correlation coefficient", commonly called simply "the correlation
    coefficient". It is obtained by dividing the covariance of the two
    variables by the product of their standard deviations.  
  *) 

  let x_b = arithmetic_mean x in
  let y_b = arithmetic_mean y in
  let s_x = unbiased_standard_deviation x in
  let s_y = unbiased_standard_deviation y in

  if s_x = 0. || s_y = 0. then 0.
  else
    let f acc x_i y_i =
      acc +. ((x_i -. x_b) *. (y_i -. y_b)) in
    let n = float_of_int (List.length x) in
    let s = List.fold_left2 f 0.0 x y  in
    s/.((n -. 1.) *. s_x *. s_y)

Saturday, September 6, 2014

Concatenation of a list of strings

Concatenation of a list of strings

Here's another fun (but probably silly) exercise. Its value I posit, is in highlighting the fundamental similarities that exist between the C++ and OCaml languages (that emerge when one "peeks" beyond the apparent dissimilarities on the surface). Maybe this sort of comparison aids in "lowering the barrier to entry" for the C++ programmer embarking on a journey into OCaml? Anyway, here we go.

The OCaml String module contains a function concat which concatenates a list of strings whilst inserting a separator between each of the elements. Prior to OCaml 4.02 at least, it's implementation went as follows:

let concat sep l =
  match l with
    [] -> ""
  | hd :: tl ->
      let num = ref 0 and len = ref 0 in
      List.iter (fun s -> incr num; len := !len + length s) l;
      let r = create (!len + length sep * (!num - 1)) in
      unsafe_blit hd 0 r 0 (length hd);
      let pos = ref(length hd) in
      List.iter
        (fun s ->
          unsafe_blit sep 0 r !pos (length sep);
          pos := !pos + length sep;
          unsafe_blit s 0 r !pos (length s);
          pos := !pos + length s)
        tl;
      r

So, a faithful translation of this program into C++ (unsafe_blit 'n all), yields this:

#include <boost/range.hpp>

#include <string>
#include <numeric>
#include <cstring>

namespace string_util /*In honor of Stefano of http://spacifico.org/ :)*/ 
{
  template <class RgT>
  std::string concat (std::string const& sep, RgT lst)
  {
    if (boost::empty (lst)) return "";

    std::size_t num = 0, len = 0;
    std::accumulate (
      boost::begin (lst), boost::end (lst), 0,
      [&](int _, std::string const& s) -> 
      int { ++num, len += s.size(); return _; } );
    std::string r(len + sep.size () * (num - 1), '\0');
    std::string const& hd = *(boost::begin (lst));
    std::memcpy ((void*)(r.data ()), (void*)(hd.data ()), hd.size());
    std::size_t pos = hd.size();
    std::accumulate (
      boost::next (boost::begin (lst)), boost::end (lst), 0,
      [&](int _, std::string const& s) -> 
      int {
        std::memcpy((void*)(r.data()+pos),(void*)(sep.data()),sep.size());
        pos += sep.size ();
        std::memcpy ((void*)(r.data()+pos),(void*)(s.data()),s.size());
        pos += s.size ();
        return _; });
  
    return r;
  }
}//namespace<string_util>
For example, this fragment
  #include <boost/assign/list_of.hpp>

  // ...

  std::list  lst = boost::assign::list_of ("foo")("bar")("baz");
  std::string r = string_util::concat (",", lst);
will produce the string "foo,bar,baz".

So there it is... As usual, a little more verbosity required on the C++ side but otherwise, not much between them IMHO. Agree?

Monday, August 25, 2014

Terms With Variables (C++)

Terms with Variables (C++)

In this earlier post I showed a nifty OCaml type for modeling terms with variables for problems involving substitutions. I got interested in what it would take to implement the type in C++(03) (doing 'sum' types in C++ elegantly is a perennial problem). It ain't nowhere near as succinct but we got there nonetheless.
#include <list>

#include <boost/variant.hpp>

/*
  type ('a, 'b) term =
    | Term of 'a * ('a, 'b) term list
    | Var of 'b
*/

template <class A, class B> struct term;
template <class B> struct var;

template <class A, class B>
struct make_tree
{
  typedef boost::variant <
    boost::recursive_wrapper<term <A, B> >, var<B> >   type;
};

template <class A, class B>
struct term
{
  typedef typename make_tree <A, B>::type tree;
  A a;
  std::list <tree> children;
  term (A a
    , std::list<tree> const& children)
    :       a (a), children (children)
  {}
};

template <class A, class B> 
inline term <A, B> make_term (
  A a, std::list<typename make_tree<A, B>::type> const& c, B const*)
{
  return term<A, B> (a, c);
}

template <class B> 
struct var
{
  B tag;
  var (B tag) : tag (tag) {}
};

template <class B>
inline var<B> make_var (B tag) { return var<B> (tag); }
For example, this little program builds the term represented by the concrete syntax "a(b(), c)".
int main ()
{
  typedef make_tree<std::string, std::string>::type tree;
  typedef std::list<tree> term_list;
  std::string const* tag_str=(std::string const*)0L;

  // a(b(), c)
  term_list l;
  l.push_back (make_term(std::string("b"), term_list (), tag_str));
  l.push_back (make_var(std::string("c")));
  tree t = make_term(std::string("a"), l, tag_str);
  
  return 0;
}

Saturday, August 9, 2014

Digits

Digits

Counting is a common problem in computer programming. I recently had a need for an algorithm to compute the number of base 10 digits of an integer (for example, there are 3 digits in the number 569). Here's what I worked out.
module type S=sig
    (**Count the number of base 10 digits of an integer*)
    val digits : int -> int 

    (**Prepend with zeroes as necessary to get the 'right'
       number of columns *)
    val digits_10 : int -> int -> string

end

module A : S = struct 
  let digits n =
    let rec loop acc n =
      let k = n / 10 in
      if k = 0 then acc else loop (acc + 1) k
    in loop 1 n

  let digits_10 i n =
    let l = digits n in
    let s = string_of_int (abs n) in
    let maybe_sign = if n < 0 then "-" else "" in
    if l >= i then (maybe_sign^s) 
    else
      let lz=string_of_int 0 in
      let rec aux acc j = 
        if j = 0 then acc 
        else 
          aux (lz^acc) (j - 1) in
      maybe_sign ^ aux s (i - l)
end

module Test_io = struct

  let test_0 () = 
    let print_line i = 
      Printf.printf "There are %d digits in %d\n" (A.digits i) i
    in
    let rec loop i n =
      if i = n then () 
      else
        let x  =
          int_of_float (10. ** (float_of_int i)) in
        print_line x; loop (i + 1) n
    in (*Range over 0 to to 10^19 *) loop 0 19

  let test_1 () =
    let print_line i = 
      Printf.printf "%s\n" i
    in
    let rec loop i n =
      if i = n then () 
      else
        let x = A.digits_10 3 i in
        print_line x; loop (i + 1) n
    in (*Range over 0 to 100, 3 columns*)loop 0 100

  let run () = test_0 (); test_1 ()

end

let () = Test_io.run ()
The program outputs
There are 1 digits in 1
There are 2 digits in 10
There are 3 digits in 100
There are 4 digits in 1000
There are 5 digits in 10000
There are 6 digits in 100000
There are 7 digits in 1000000
There are 8 digits in 10000000
There are 9 digits in 100000000
There are 10 digits in 1000000000
There are 11 digits in 10000000000
There are 12 digits in 100000000000
There are 13 digits in 1000000000000
There are 14 digits in 10000000000000
There are 15 digits in 100000000000000
There are 16 digits in 1000000000000000
There are 17 digits in 10000000000000000
There are 18 digits in 100000000000000000
There are 19 digits in 1000000000000000000
000
001
002
003
004
005
006
 .
 .
 .
097
098
099

Fun, fun, fun!

Friday, July 25, 2014

Cartesian product (redux)

Cartesian product (redux)

In this blog post we looked at programs to compute Cartesian Products. One algorithm (given here in OCaml) if you recall is
let prod l r =
  let g acc a =
    let f acc x =
      (a, x) :: acc
    in List.fold_left f acc r
  in List.fold_left g [] l |> List.rev

In that earlier post I translated the above program into C++. This time I'm doing the same straightforward translation but using the Boost.MPL library to compute such products at compile time. It looks like this:

#include <boost/mpl/pair.hpp>
#include <boost/mpl/vector.hpp>
#include <boost/mpl/push_front.hpp>
#include <boost/mpl/fold.hpp>
#include <boost/mpl/reverse.hpp>
#include <boost/mpl/placeholders.hpp>

using namespace boost::mpl::placeholders;

template <class L, class R>
struct product
{
  struct g {
    template <class AccT, class A>
    struct apply {
       typedef typename boost::mpl::fold <
         R, AccT
        , boost::mpl::push_front<_1, boost::mpl::pair<A, _2> > 
         >::type type;
    };
  };

  typedef typename boost::mpl::reverse <
    typename boost::mpl::fold <
                  L, boost::mpl::vector<>, g>::type>::type type;
};
The translation proceeds almost mechanically! Does it work though? You betcha! Here's a test we can convince ourselves with:
#include <boost/mpl/equal.hpp>
#include <boost/mpl/for_each.hpp>
#include <boost/mpl/int.hpp>

#include <iostream>

using namespace boost::mpl;

typedef vector<int_<1>, int_<2> > A;
typedef vector<int_<3>, int_<4>, int_<5> > B;
typedef product <A, B>::type result;

BOOST_MPL_ASSERT((
  equal<
    result
  , vector<
         pair<int_<1>, int_<3> >
       , pair<int_<1>, int_<4> >
       , pair<int_<1>, int_<5> >
       , pair<int_<2>, int_<3> >
       , pair<int_<2>, int_<4> >
       , pair<int_<2>, int_<5> >
      > 
  > ));

struct print_class_name {
    template <typename T>
    void operator()( T t ) const {
       std::cout << typeid(t).name() << " ";
    }
};

int main ()
{
  boost::mpl::for_each<result>(print_class_name());

  return 0;
}
The takeaway is, learning yourself some functional programming is a great way to improve your template meta-programming fu! (That of course should come as no surprise... )

Friday, July 18, 2014

Merge sort

Merge sort

Here's another divide and conquer algorithm. This one is for sorting a sequence.

Conceptually, a merge sort works like this (see http://en.wikipedia.org/wiki/Merge_sort):

  • Divide the unsorted list into n sub-lists, each containing one element (a list of one element is trivially sorted);
  • Repeatedly merge sublists to produce new sorted sub-lists until there is only one sub-list remaining : this will be the sorted list.

In the following OCaml, we are drawing on inspiration from the the Haskell standard prelude for the part of the algorithm concerned with dividing the unsorted list : functions take, drop and split.

(**{b Merge sort}, an {i O(n log n)} comparison based sorting
   algorithm *)
module type MERGESORT = sig

  (**[take n] applied to a list [xs], returns the prefix of [xs] of
     length [n] or [xs] itself if [n > List.length xs] e.g. [take 2
     [1; 2; 3]] {i = } [[1; 2]]*)
  val take : int -> 'a list -> 'a list

  (**[drop n] applied to a list [xs], returns the suffix of [xs]
     after the first [n] elements or, [[]] if [n > List.length xs]
     e.g. [drop 2 [1; 2; 3]] {i = } [[3]]*)
  val drop : int -> 'a list -> 'a list

  (**[split n xs] is equivalent to [take n xs, drop n xs] e.g.
     [split 2 [1; 2; 3]] {i = } [([1; 2], [3])]*)
  val split : int -> 'a list -> 'a list * 'a list

  (**[merge] given two {b sorted} sequences [xs] and [ys] returns a
     single sorted sequence of the elements of [xs] and [ys]
     e.g. [merge [1; 3] [2; 4]] {i = } [[1; 2; 3; 4]]*)
  val merge : 'a list -> 'a list -> 'a list

  (**[merge_sort] works by splitting a sequence into two parts,
     recursively sorting the two parts and merging the results into
     a single sorted sequence e.g. [merge_sort [1; 2; -1; 0; 3]] 
     {i = } [[-1; 0; 1; 2; 3]]*)
  val merge_sort : 'a list -> 'a list
end

module Merge_sort : MERGESORT = struct

  let rec take k l =
    match (k, l) with
    | n, _ when n <= 0 -> []
    | _, [] -> []
    | n, (x :: xs) -> x :: take (n - 1) xs

  let rec drop k l =
    match (k, l) with
    | n, xs when n <= 0 -> xs
    | _, [] -> []
    | n, (_ :: xs) -> drop (n - 1) xs

  let rec split k l = take k l, drop k l

  let rec merge l m =
    match (l, m) with
    | [], ys -> ys
    | xs, [] -> xs
    | ((x :: xs) as t), ((y :: ys) as s) -> 
      if x <= y then x :: (merge xs s) else y :: (merge t ys)
        
  let rec merge_sort l =
    let i = (List.length l) / 2 in
    if i = 0 then l
    else
      let u, v = split i l in
      let xs, ys = merge_sort u, merge_sort v in
      merge xs ys

end

We can test our little program in the top-level like this:

# let l = Merge_sort.merge_sort [-1; 2; 0; 4; 3; 5];;
val l : int list = [-1; 0; 2; 3; 4; 5]

Here are the functions in C++ phrased as range algorithms.

//"c:/users/fletch/desktop/setenv.cmd"
//cl /EHsc /Femerge.exe /I c:/project/boost_1_55_0 merge.cpp 

#include <boost/next_prior.hpp>
#include <boost/range.hpp>
#include <boost/range/algorithm/copy.hpp>

#include <vector>
#include <cstdlib>

namespace algo
{
  //take

  template <class S, class D>
  D take (std::size_t n, S src, D dst)
  {
    typedef boost::range_iterator<S>::type it_t;
    it_t curr = boost::begin (src), last = boost::end (src);
    if (n <= 0) return dst;
    if (boost::empty (src))  return dst;
    take (n-1, S (boost::next (curr), last), *dst++ = *curr);
    return dst;
  }

  //drop

  template <class S, class D>
  D drop (std::size_t n, S src, D dst)
  {
    typedef boost::range_iterator<S>::type it_t;
    it_t curr = boost::begin (src), last = boost::end (src);
    if (n <= 0) return boost::range::copy (src, dst);
    if (boost::empty (src))return dst;
    drop (n-1, S (boost::next (curr), last), dst);
    return dst;
  }

  //split

  template <class S, class D1, class D2>
  void split (int n, S src, D1 dst1, D2 dst2)
  { take (n, src, dst1); drop (n, src, dst2); }

  //merge

  template <class S1, class S2, class D>
  D merge (S1 src1, S2 src2, D dst)
  {
    typedef boost::range_iterator<S1>::type it1_t;
    typedef boost::range_iterator<S2>::type it2_t;
    typedef std::iterator_traits<it1_t>::value_type value1_t;
    typedef std::iterator_traits<it2_t>::value_type value2_t;
    if (boost::empty (src1)) return boost::copy (src2, dst);
    if (boost::empty (src2)) return boost::copy (src1, dst);
    it1_t curr1 = boost::begin (src1), last1 = boost::end (src1);
    it2_t curr2 = boost::begin (src2), last2 = boost::end (src2);
    value1_t x = *curr1;
    value2_t y = *curr2;
    if (x <= y)
      merge (S1 (boost::next (curr1), last1), src2, *dst++ = x);
    else
      merge (src1, S2 (boost::next (curr2), last2), *dst++ = y);
    return dst;
  }

  //merge_sort

  template <class S, class D>
  D merge_sort (S src, D dst)
  {
    typedef boost::range_iterator<S>::type it_t;
    typedef std::iterator_traits<it_t>::value_type value_t;
    std::size_t i = boost::size (src)/2;
    if (i == 0) return boost::range::copy (src, dst);
    std::vector<value_t> u, v;
    split (i, src, std::back_inserter (u), std::back_inserter (v));
    std::vector<value_t> xs, ys;
    merge_sort (u, std::back_inserter (xs));
    merge_sort (v, std::back_inserter (ys));
    merge (xs, ys, dst);
    return dst;
  }
  
}//namespace algo

The following usage example provides a lightweight test.

#include <boost/assign/list_of.hpp>

#include <utility>
#include <iterator>
#include <iostream>

int main ()
{
  using std::pair;
  using std::make_pair;
  using std::ostream_iterator;
  using boost::assign::list_of;

  int ord[] = {1, 2, 3, 4};

  auto src=make_pair(ord, ord + 4);
  auto dst=ostream_iterator<int>(std::cout, ", ");

  std::cout << "\ntake ():\n";

  algo::take (0u, src, dst); std::cout << std::endl;
  algo::take (1u, src, dst); std::cout << std::endl;
  algo::take (2u, src, dst); std::cout << std::endl;
  algo::take (3u, src, dst); std::cout << std::endl;
  algo::take (4u, src, dst); std::cout << std::endl;
  algo::take (5u, src, dst); std::cout << std::endl;

  std::cout << "\ndrop ():\n";

  algo::drop (5u, src, dst); std::cout << std::endl;
  algo::drop (4u, src, dst); std::cout << std::endl;
  algo::drop (3u, src, dst); std::cout << std::endl;
  algo::drop (2u, src, dst); std::cout << std::endl;
  algo::drop (1u, src, dst); std::cout << std::endl;
  algo::drop (0u, src, dst); std::cout << std::endl;

  std::cout << "\nsplit ():\n\n";

  algo::split (0u, src, dst, dst); std::cout << std::endl;
  algo::split (1u, src, dst, dst); std::cout << std::endl;
  algo::split (2u, src, dst, dst); std::cout << std::endl;
  algo::split (3u, src, dst, dst); std::cout << std::endl;
  algo::split (4u, src, dst, dst); std::cout << std::endl;
  algo::split (5u, src, dst, dst); std::cout << std::endl;

  std::cout << "\nmerge ():\n\n";

  std::vector <int> l=list_of(-1)(2), r=list_of(0)(1);
  algo::merge (l, r, dst); std::cout << std::endl;

  std::cout << "\nmerge_sort ():\n\n";

  int unord[] = {-1, 2, 0, 4, 3, 5};
  algo::merge_sort (make_pair (unord, unord + 6), dst);

  return 0;
}
The above program produces the following output.
take ():

1,
1, 2,
1, 2, 3,
1, 2, 3, 4,
1, 2, 3, 4,

drop ():


4,
3, 4,
2, 3, 4,
1, 2, 3, 4,

split ():

1, 2, 3, 4,
1, 2, 3, 4,
1, 2, 3, 4,
1, 2, 3, 4,
1, 2, 3, 4,
1, 2, 3, 4,

merge ():

-1, 0, 1, 2,

merge_sort ():

-1, 0, 2, 3, 4, 5,

Saturday, June 21, 2014

Permutations

Permutations

Combinations are selections that disregard order. The powerset P(S) of a set S is the set of all possible combinations of the n elements of S. There are 2n of these. Permutations are arrangements of sequences into orders. The number of permutations of a set is denoted n! For example, the set {x, y, z} has 6 permutations : {x, y, z}, {x, z, y}, {y, x, z}, {y, z, x}, {z, y, x} and {z, x, y}. Suppose f exists that for given k provides all permutations of S that start with sk. All permutations could then be found by f with k ranging over [0, N - 1]. But f trivially exists as it can be computed by prepending sk to the permutations of the smaller set S - { sk }. Here are two programs to compute the permutations of a set based on this idea.

Thursday, June 19, 2014

Powerset

Powerset

If S is a set, then P(S), the 'powerset' of S is the set of all subsets of S including the empty set and S itself. If S has cardinality N then the cardinality of P(S) is 2N (why?). That is, there are 2N subsets associated with S.

Here's a function to compute P(S) in OCaml. It's an instance of the 'divide and conquer' strategy of problem solving.

  let rec sets l =
    match l with
    | [] -> [[]]
    | x :: xs -> let l = sets xs in 
                   l @ (List.map (fun y -> x :: y) l)

This program translates to C++ naturally and with brevity thanks to C++ 11 lambdas.

#include <boost/utility.hpp>

#include <set>
#include <iterator>
#include <algorithm>

template <class I, class D>
D sets (I begin, I end, D dst)
{
  typedef typename std::iterator_traits<I>::value_type value_type;
  typedef std::set<value_type> set_t;

  if (begin == end)
    {
      *dst++ = set_t (); //the empty set
    }
  else
    {
      std::set<set_t> l;
      std::set<set_t>::iterator back=l.end ();
      sets (boost::next (begin), end, std::inserter (l, back));

      std::transform (l.begin (), l.end (), dst, 
        [](set_t const& s) -> set_t const& { return s; });
      std::transform (l.begin (), l.end (), dst, 
        [=](set_t s) -> set_t { s.insert (*begin); return s; });
    }

  return dst;
}

Thursday, June 12, 2014

Cartesian product

Cartesian product

For two sets $A$ and $B$, the Cartesian product denoted $A \times B$ is defined as the set of all ordered pairs $(a, b)$ where $a \in A$ and $b \in B$. The obvious algorithm to compute a Cartesian product in OCaml can is simple!
let prod l r =
  let g acc a =
    let f acc x =
      (a, x) :: acc
    in List.fold_left f acc r
  in List.fold_left g [] l |> List.rev

A straight-forward translation of the above program into C++ yields this.

#include <boost/range.hpp>
#include <numeric>
#include <iterator>

template <class T>
struct _inner
{
  T a;
  _inner (T a) : a (a) {}
  template <class ItT>
    ItT operator ()(ItT acc, T const& x) const {
    return *acc++ = std::make_pair (a, x);
  }
};
template <class T> 
  _inner<T> inner (T a) { return _inner<T> (a); }

template <class RangeT>
struct _outer
{
  RangeT r;
  _outer (RangeT r) : r (r){}
  template <class T, class ItT>
    ItT operator ()(ItT acc, T const& a) const {
      return std::accumulate (
        boost::begin (r), boost::end(r), acc, inner (a));
  }
};
template <class RangeT>
  _outer<RangeT> outer (RangeT r) { return _outer<RangeT>(r); }

template <class R1, class R2, class ItT>  
ItT prod (R1 A, R2 B, ItT dst) {
  return std::accumulate (
    boost::begin (A), boost::end (A), dst, outer (B));
}

That's a lot more code than in the original OCaml program. But wait... C++11 lambda expressions to the rescue! We can eliminate most of that 'extra' code recovering the elegance of the original.

template <class R1, class R2, class ItT>  
ItT prod (R1 A, R2 B, ItT dst) {

  typedef ItT iterator;
  typedef boost::range_value<R1>::type alpha;
  typedef boost::range_value<R2>::type  beta;

  return std::accumulate (
   boost::begin (A), boost::end (A), dst, 
    [=] (iterator acc, alpha const& a) { 
      return std::accumulate (
        boost::begin (B), boost::end (B), acc, 
        [=] (ItT acc, beta const& x) {
          return *acc++ = std::make_pair (a, x); });
     });
}

Now, my buddy Juan Alday tells me that we can expect more improvements in C++14 relating to lambdas. I hope to persuade him to write more about that here in the next couple of weeks. Stay tuned!

Update: Juan advises that with C++ 14 'generic' lambdas, we can further get it down to this.

auto prod = [](auto const& A, auto const& B, auto dst) {
   return std::accumulate(std::begin(A), std::end(A), dst,
     [&B](auto& output, auto valA) {
      return std::accumulate(std::begin(B), std::end(B), output,
       [&valA](auto& output, auto valB) { 
         std::get<0>(*output) = std::move(valA); 
         std::get<1>(*output) = std::move(valB)); 
         return ++output;});
     });
 };
That's kind of amazing... There's not even one occurrence of the template keyword in that code!

Thursday, May 15, 2014

Depth first search

Depth first search

Depth first search is an 'elementary graph algorithm'. This is a purely functional formulation.

(*depth-first-search

 "Introduction to Algorithms" - Cormen et. al., 1994

*)
module Char_map = Map.Make (Char)

type graph = (char list) Char_map.t

module type S = sig
  type state
  val string_of_state : state -> string
  val depth_first_search : graph -> state
end

module Dfs : S = struct

  type colors = White|Gray|Black

  type state = {
    d : int Char_map.t ; (*discovery time*)
    f : int Char_map.t ; (*finishing time*)
    pred : char Char_map.t ; (*predecessor*)
    color : colors Char_map.t ; (*vertex colors*)
  }

  let string_of_state {d; f; pred; color} =
    let open Printf in
    let bindings m fmt =
      let b = Char_map.bindings m in
      String.concat ", " (List.map (fun (x,y) -> sprintf fmt x y) b) in
    sprintf " d = {%s}\n f = {%s}\n pred = {%s}\n"
      (bindings d "'%c':'%d'") (bindings f "'%c':'%d'")
      (bindings pred "'%c':'%c'")

  let depth_first_search g =
    let node u (t, {d; f; pred; color}) =
      let rec dfs_visit t u {d; f; pred; color} =
        let edge (t, {d; f; pred; color}) v =
          if Char_map.find v color = White then
            dfs_visit t v {d; f; pred=(Char_map.add v u pred); color}
          else  (t, {d; f; pred; color})
        in
        let t, {d; f; pred; color} =
          let t = t + 1 in
          List.fold_left edge
            (t, {d=Char_map.add u t d; f;
                 pred; color=Char_map.add u Gray color})
            (Char_map.find u g)
        in
        let t = t + 1 in
        t , {d; f=(Char_map.add u t f); pred; color=Char_map.add u Black color}
      in
      if Char_map.find u color = White then dfs_visit t u {d; f; pred; color}
      else (t, {d; f; pred; color})
    in
    let v = List.fold_left (fun acc (x, _) -> x::acc) [] (Char_map.bindings g) in
    let initial_state= 
       {d=Char_map.empty;
        f=Char_map.empty;
        pred=Char_map.empty;
        color=List.fold_right (fun x->Char_map.add x White) v Char_map.empty}
    in
    snd (List.fold_right node v (0, initial_state))

end

(* Test *)

let () =
  let g =
       List.fold_right
          (fun (x, y) -> Char_map.add x y)
          ['u', ['v'; 'x'] ;
           'v',      ['y'] ;
           'w', ['z'; 'y'] ;
           'x',      ['v'] ;
           'y',      ['x'] ;
           'z',      ['z'] ;
          ]
          Char_map.empty
  in
  let s = Dfs.depth_first_search g in
  Printf.printf "%s\n" (Dfs.string_of_state s)

Sunday, April 27, 2014

Being normal

Being normal

Given the sophistication of today's financial markets, it's hard to imagine that until 1973, despite active trading since at least the 17th century, there was no known way of estimating the price of European-style options!

Black, Scholes and Merton changed all that with the development of a mathematical framework of a financial market in their seminal paper "The Pricing of Options and Corporate Liabilities". From that model, one can deduce the celebrated Black-Scholes formula : $p = \theta \cdot \left[ F \cdot \Phi\left( \theta \cdot \left[ \frac{\ln \left( F/K \right)}{\sigma} + \frac{\sigma}{2}\right] \right) - K \cdot \Phi\left( \theta \cdot \left[ \frac{\ln \left( F/K \right)}{\sigma} - \frac{\sigma}{2} \right] \right) \right]$ where $\theta = 1$ for call options and $\theta = -1$ for put options, $F := Se^{\left( r-d \right)T}$, $S$ is the current spot, $K$ is the strike, $r$ is a flat continuously compounded interest rate to maturity, $d$ is a flat continuously compounded dividend rate, $\sigma = \hat{\sigma} \cdot \sqrt{T}$, $\hat{\sigma}$ is the root-mean-square lognormal volatility, $T$ is the time to expiry and $\Phi(\cdot)$ is the standard cumulative normal distribution function.

Now, the only real problem with evaluating the above equation is in estimating the the values of the cumulative distribution function of the normal distribution ($\Phi$). This is a function that is defined by an integral that cannot be expressed in terms of elementary functions and is thus said to be one of a family of special functions.

One approach to obtaining a high quality, cross platform implementation of $\Phi$ in OCaml is to leverage the C++ Boost Math Toolkit library. First the C wrapper (let's write this in a file 'special_functions_c.cpp' (say)).

#include <boost/math/distributions/normal.hpp>

#include <caml/mlvalues.h>
#include <caml/alloc.h>

extern "C" value caml_norm_cdf (value v)
{
  using boost::math::cdf;
  using boost::math::normal;

  return caml_copy_double (cdf (normal (0.0, 1.0), Double_val (v)));
}
Next, the forwarding OCaml function ('special_functions.ml').
external caml_norm_cdf : float -> float = "caml_norm_cdf"

let norm_cdf x = caml_norm_cdf x
Lastly, the module interface ('special_functions_sig.mli').
module type S = sig
    val norm_cdf : float -> float
end

Et voilà. With this in hand, the Black-Scholes equation follows easily!

let black_scholes ~spot ~strike ~r ~vol ~t ~cp =
  let open Special_functions in
  let d1 = (log (spot/.strike) +. 
              (r +. 0.5*.(vol*.vol))*.t)/.(vol*.(sqrt t)) in
  let d2 = d1 -. vol*.(sqrt t) 
  in
    cp*.spot*.(norm_cdf (cp*.d1)) -. 
       cp*.strike*.(exp ((~-.r)*.t))*.(norm_cdf (cp*.d2))
Building the OCaml special functions library above follows the same procedure as covered in this blog entry. To wrap things up (and marvel at our handiwork,) here's a little test.
let _ = 
  Printf.printf "%.8f\n" 
    (black_scholes ~spot:42. ~strike:40. ~r:0.1 ~vol:0.2 ~t:0.5 ~cp:1.0) ;
  Printf.printf "%.8f\n" 
    (black_scholes ~spot:42. ~strike:40. ~r:0.1 ~vol:0.2 ~t:0.5 ~cp:(~-.1.0))
I observe the values 4.75942239 for the call and, 0.80859937 for the put, respectively.

Saturday, March 29, 2014

Power means

Power means

Before we get going, let me mention that my friends over at What does the quant say? have started a blog! This weeks post is about? You guessed it, statistics! You can check out "Getting our hands dirty with beta functions" by clicking here.

This post picks up where we left off in "Statistics", looking at different definitions for the mean of a sequence of numbers.

The geometric mean together with the arithmetic and harmonic means of the last post together are called the Pythagorean means.

let geometric_mean t =
  (*http://en.wikipedia.org/wiki/Geometric_mean

    The geometric mean is defined as the nth root (where n is the
    count of numbers) of the product of the numbers.

  *)
  let n = List.length t in
  let prod = List.fold_left (fun acc xi -> acc *. xi) 1.0 t in
  prod ** (1.0/.(float_of_int n))
Note the presence of the power function powf (operator ( ** ))! How to compute values of this function was the focus of this post.

Here is yet another common mean, the quadratic mean.

let quadratic_mean t =
  (*http://en.wikipedia.org/wiki/Root_mean_square

  In mathematics, the root mean square (abbreviated RMS or rms),
  also known as the quadratic mean, is a statistical measure of the
  magnitude of a varying quantity. It is especially useful when
  variates are positive and negative, e.g., sinusoids. RMS is used
  in various fields, including electrical engineering.  It can be
  calculated for a series of discrete values or for a continuously
  varying function. Its name comes from its definition as the square
  root of the mean of the squares of the values.

  *)
  let squares = List.fold_left 
    (fun acc xi -> acc @ [xi *. xi]) [] t in
  sqrt (arithmetic_mean squares)

The generalized mean or "power mean" includes all of the means we have considered to date as special cases.

let power_mean p t =
  (*http://en.wikipedia.org/wiki/Generalized_mean

    In mathematics, generalized means are a family of functions for
    aggregating sets of numbers, that include as special cases the
    arithmetic, geometric, and harmonic means

  *)
  let powers = List.fold_left 
    (fun acc xi -> acc @ [( ** ) xi p]) [] t in
    (arithmetic_mean powers)**(1.0/.p)
Note: Recovering the geometric mean from this definition requires looking at the limit of the expression as p -> 0 and application of L'Hopital's rule (see http://en.wikipedia.org/wiki/Generalized_mean for the details).

Sunday, March 23, 2014

Statistics

Statistics

Some more basic functions, this time focusing on sequences of numbers. First sum.
let sum t = 
(*http://en.wikipedia.org/wiki/Sum

  Summation is the operation of adding a sequence of numbers

 *)
List.fold_left (fun acc x -> acc +. x) 0.0 t
Now, arithmetic_mean follows trivially.
let arithmetic_mean t = 
(*http://en.wikipedia.org/wiki/Arithmetic_mean

Simply the mean or average when the context is clear, is the sum
of a collection of numbers divided by the number of numbers in the
collection.

*)
(sum t)/.(float_of_int (List.length t))
This formulation shows clearly the relationship between standard deviation and arithmetic mean.
let standard_deviation t =
(*http://en.wikipedia.org/wiki/Standard_deviation

For a finite set of numbers, the standard deviation is found by
taking the square root of the average of the squared differences
of the values from their average value.

 *)
 let av = arithmetic_mean t in
 let squared_diffs = List.fold_left 
  (fun acc xi -> ((xi -. av) *. (xi -. av)) :: acc) [] t
 in sqrt (arithmetic_mean squared_diffs)
The harmonic_mean function is, I find, a more interesting function.
let harmonic_mean t =
(*http://en.wikipedia.org/wiki/Harmonic_mean

The harmonic mean is the preferable method for averaging
multiples, such as the price/earning ratio, in which price is in
the numerator. If these ratios are averaged using an arithmetic
mean (a common error), high data points are given greater weights
than low data points. The harmonic mean, on the other hand, gives
equal weight to each data point.

*)
 let n = float_of_int (List.length t) in
 let denom=List.fold_left 
             (fun acc xi -> acc +. 1.0 /. xi) 0.0 t 
 in n /. denom

Saturday, March 22, 2014

Financial Modeling in Python : Windows 7, 64-bit ppf build status update

ppf build status update

Good results obtained on Windows 7 using Microsoft SDK 7.1 (msvc-10.0, x64), Boost-1.55.0, Blitz++-0.10, ActivePython-2.7.6 and Numpy (MKL-1.8).

There were a few 'tweaks' necessary to obtain the build. One thing to watch out for is the absence of these declarations in blitz-0.10/blitz/ms/bzconfig.h:

/* Macro for declaring aligned variables */
#ifndef BZ_ALIGN_VARIABLE
#  define BZ_ALIGN_VARIABLE(vartype,varname,alignment) \
     vartype varname; \
  /**/
#endif

/* Set SIMD instruction width in bytes */
#ifndef BZ_SIMD_WIDTH
#  define BZ_SIMD_WIDTH 1
#endif
I built ppf_date_time.pyd and ppf_math.pyd with a command along the lines of
bjam toolset=msvc-10.0 link=shared threading=multi variant=release address-model=64
I think the presence of these features confuses the install commands. I had to move the files into final position by hand. My user-config.jam was this:
import toolset : using ;

using msvc 
  : 10.0 
  : "c:/Program Files (x86)/Microsoft Visual Studio 10.0/VC/Bin/amd64/cl.exe" 
  ;

using python 
  : 2.7 
  : "c:/python27/python.exe"
  ;
and my site-config.jam read like this
import project ;
project.initialize $(__name__) ;
project site-config ;

alias boost_headers  : : : : 
  <include>c:/project/boost_1_55_0  
  ;

alias python_headers : : : : 
  <include>c:/python27/include 
  ;

alias numpy_headers : : : : 
  <include>c:/Python27/Lib/site-packages/numpy/core/include 
  ;

alias blitz_headers : : : : 
  <include>c:/project/blitz-0.10 
  ;

lib python 
  : 
  : <name>python27 
  : <search>c:/python27/libs 
    <link>shared 
  ;

lib boostpython 
  : 
  : <name>boost_python-vc100-mt-1_55 
    <search>c:/project/boost_1_55_0/stage/lib 
    <link>shared <variant>release 
    <toolset-msvc:version>10.0 
  ;

lib boostunittestframework 
  : 
  : <name>boost_unit_test_framework-vc100-mt-1_55 
    <search>c:/project/boost-1_55_0/stage/lib 
    <link>static <variant>release 
    <toolset-msvc:version>10.0 
  ;

lib boostdatetime 
  : 
  : <name>boost_date_time-vc100-mt-1_55 
    <search>c:/project/boost-1_55_0/stage/lib 
    <link>static 
    <variant>release 
    <toolset-msvc:version>10.0 
  ;
Lines 24-29 of ppf/ext/lib/date_time/src/register_date_more.cpp seem to trigger an ambiguous overload error - that needs looking at. I worked around it with the following for now.
  std::string to_string() const
  {
    //return this->get_override("to_string")();
    return "not implemented - fix me";
  }
After build & installation (python setup.py install), as always, don't forget to configure your environment approiately.
set PATH="c:/project/boost-1_55_0/stage/lib;%PATH%"
set PYTHONPATH="c:/python27/lib/site-packages;%PYTHONPATH%"
(of course make sure that the files c:/python27/lib/site-packages/ppf/date_time/ppf_date_time.pyd, c:/python27/lib/site-packages/ppf/math/ppf_math.pyd and c:/project/boost-1_55_0/stage/lib/boost_python-vc100-mt-1_55.dll exist).

Testing goes like this.

c:\Python27\Lib\site-packages\ppf\test>python test_all.py --verbose
python test_all.py --verbose
test_call (test_core.black_scholes_tests) ... ok
test_put (test_core.black_scholes_tests) ... ok
test (test_core.libor_rate_tests) ... ok
test (test_core.swap_rate_tests) ... ok
test_nth_imm_of_year (test_date_time.imm_tests) ... ok
test_first_imm_after (test_date_time.imm_tests) ... ok
test_first_imm_before (test_date_time.imm_tests) ... ok
test (test_date_time.business_day_tests) ... ok
test_discount_factor (test_hull_white.requestor_tests) ... ok
test_term_vol (test_hull_white.requestor_tests) ... ok
test (test_hull_white.state_tests) ... ok
test_numeraire_rebased_bond (test_hull_white.fill_tests) ... ok
test_libor (test_hull_white.fill_tests) ... ok
test_discounted_libor_rollback (test_hull_white.rollback_tests) ... ok
test_bond_option (test_hull_white.rollback_tests) ... ok
test_constant (test_hull_white.rollback_tests) ... ok
test_mean_and_variance (test_hull_white.evolve_tests) ... ok
test_bond (test_hull_white.evolve_tests) ... ok
test_explanatory_variables (test_hull_white.exercise_tests) ... ok
test_value (test_lattice_pricer.swap_tests) ... ok
test_value (test_lattice_pricer.bermudan_tests) ... ok
deep_in_the_money_test (test_lattice_pricer.moneyness_tests) ... ok
deep_out_the_money_test (test_lattice_pricer.moneyness_tests) ... ok
test (test_market.surface_tests) ... ok
test (test_math.linear_interpolation_tests) ... ok
test_list (test_math.cubic_spline_interpolation_tests) ... ok
test_array (test_math.cubic_spline_interpolation_tests) ... ok
test (test_math.solve_tridiagonal_system_tests) ... ok
test (test_math.solve_upper_diagonal_system_tests) ... ok
test (test_math.singular_value_decomposition_back_substitution_tests) ... ok
test (test_math.generalised_least_squares_fit_tests) ... ok
test1 (test_math.bisect_tests) ... ok
test2 (test_math.bisect_tests) ... ok
test (test_math.newton_raphson_tests) ... ok
test (test_math.piecewise_cubic_fitting_tests) ... ok
test1 (test_math.normal_distribution_integral_tests) ... ok
test2 (test_math.normal_distribution_integral_tests) ... ok
lognormal_martingale_test (test_math.integrator_tests) ... ok
tower_law_test (test_math.integrator_tests) ... ok
atm_option_test (test_math.integrator_tests) ... ok
no_trigger_limit_test (test_monte_carlo_pricer.tarn_tests) ... ok
trigger_after_first_flow_test (test_monte_carlo_pricer.tarn_tests) ... ok
monotonic_with_target_test (test_monte_carlo_pricer.tarn_tests) ... ok
test_value (test_monte_carlo_pricer.bermudan_tests) ... ok
deep_in_the_money_test (test_monte_carlo_pricer.moneyness_tests) ... ok
deep_out_the_money_test (test_monte_carlo_pricer.moneyness_tests) ... ok
test_lower_bound (test_utility.bound_tests) ... ok
test_upper_bound (test_utility.bound_tests) ... ok
test_equal_range (test_utility.bound_tests) ... ok
test_bound (test_utility.bound_tests) ... ok
test_bound_ci (test_utility.bound_tests) ... ok

----------------------------------------------------------------------
Ran 51 tests in 8.090s

OK

It seems there are builds now for Python 3.3 Numpy here. With these, it should be possible to produce a Python 3.3, 64-bit Windows ppf. This upgrade will require some minor changes to the ppf Python code.

Friday, March 21, 2014

Powers

Powers

bn is familiar when n is a whole integer (positive or negative). Here's an algorithm to compute it.
let rec pow (i, u) = 
  if i < 0 then 1.0 / (pow (-i, u))
  else if i = 0 then 1.0 else u * pow ((i-1), u)
It is also familiar to us when c/d = n that is, for n a rational number (then the solution is found by taking the dth root of bc). Less so though I think when x = n is some arbitrary real. That is, generally speaking how is bx computed?

Logarithms and exponentiation. Since b = e log b then bx = (elog b)x = e x (log b).
let powf (b, x) = exp (x * log b)
This only works when b is real and positive. The case negative b leads into the theory of complex powers.

“Sometimes,' said Pooh, 'the smallest things take up the most room in your heart.”