Saturday, December 13, 2014

Recognizers

In my last post, I gave an OCaml program to parse regular expressions. I intend however to show you, over the coming weeks, not just how to parse them, but how to compile them to recognizers too. Doing so will require me to share quite a lot of code that I gleaned from the book The Functional Approach to Programming by Guy Cousineau and Michel Mauny from which I learned how to do this. In doing so, I will along the way, provide updated code in modern OCaml as the book is presented in the Caml language a predecessor of today's OCaml, and not everywhere equivalent. Additionally, there will be functions of my own "filling in" gaps left as exercises in the book. That said, I don't think there's enough effort of my own to justify "plagiarizing" the material so blatantly so am determined to add at least a little value, where my limited skills allow, in order that I might redeem myself.

So today, let us start with a minimal combinator library implementing "recognizers" for recursive descent parsers, based on Cousineau and Mauny transliterated to the C++ programming language. I hope on reading the following, that you will agree, the multi-paradigm language C++-11 admits "The Functional Approach to Programming" as a "first class citizen" and the result here might be viewed as the beginning of a mini Boost.Spirit! Let's begin.

We are concerned with the problem of "recognizing" phrases belonging to a given language. We will produce programs (functions) that accept character strings as input which decide to accept or reject them. First some headers.

#include <boost/variant.hpp>
#include <boost/variant/apply_visitor.hpp>
#include <boost/utility.hpp>
#include <boost/range.hpp>
#include <boost/detail/lightweight_test.hpp>

#include <list>
#include <functional>
#include <numeric>
#include <string>

Recognizers work on lists. When recognition succeeds, part of the list is consumed and the part of the list remaining is returned or, recognition fails. So the type remaining<A> is a sum type with two cases, where A is the type of "tokens" (symbols) contained by the list (typically but not necessarily char).

//Recognition succeeded
template <class A> 
struct remains{ 
  std::list<A> left;
  template <class ItT>
    remains (ItT begin, ItT end) 
      : left (begin, end) 
    {}
};

//Recognition failed
template <class A>
struct recognition_fails {};

//Result of a recognizer. Indicates the result of attempting to
//recognize something from a list
template <class A> using remaining = 
  boost::variant<remains <A>, recognition_fails<A> >;
Recognizers then are but functions from lists of tokens to values of remaining<A>.
//A 'recognizer' is a function from a list to a value of remaining<A>
template <class A> using recognizer = 
  std::function<remaining<A>(std::list<A> const&)>;
Here's an example. This function (constant) produces a recognizer that matches the empty string. The recognizer it produces always succeeds and never consumes any input.
//A recognizer that recognizes the empty string. It always succeeds and
//no input is ever consumed
template <class A>
recognizer<A> epsilon ()
{
  return [](std::list<A> const& cs) -> remaining<A>
    {
      return remains<A> (cs.begin (), cs.end ());
    };
}
This next factory function produces recognizers that recognize elements that satisfy a user provided predicate.
//Given a predicate, 'recognizer_of_token' produces the recognizer
//associated with the elements that satisfy this predicate
template <class A, class F/*bool(A)*/>
recognizer<A> recognizer_of_token (F test)
{
  return
    [=] (std::list<A> const& cs) -> remaining<A>
    {
      if (cs.empty ())
        return recognition_fails<A> ();

      if (test (cs.front ()))
        return remains<A> (boost::next (cs.begin()), cs.end());

      return recognition_fails<A>();
    };
}
Recognizers can be composed. This function takes two recognizers and returns a recognizer that will, when presented with a list of tokens, attempt recognition via the first and if that doesn't work out, backtrack and attempt to recognize via the second. It's the "if then else" of recognizers.
//Recognizer disjunction
template <class A>
recognizer<A> compose_or (recognizer<A> p, recognizer<A> q)
{
  return [=](std::list<A> const& toks) -> remaining<A>
    {
      remaining <A> res = p (toks);
      if (remains<A>* rem = boost::get<remains<A> >(&res)) return *rem;

      return q (toks);
    };
}
This next function is the builder of the corresponding "and then" sort of recognizer.
//Recognizer conjunction
template <class A>
recognizer<A> compose_and (recognizer<A> p, recognizer<A> q)
{
  return [=](std::list<A> const& toks) -> remaining<A>
    {
      remaining <A> res = p (toks);
      if (remains<A>* rem = boost::get<remains<A> >(&res)) 
        return q (rem->left);

      return recognition_fails<A> ();
    };
}
With disjunction and conjunction of recognizers, we can implement a function to compute, given a recognizer, the corresponding Kleene star recognizer. That is, zero or more occurrences of a given phrase.
//The '*' iterator (Kleene star)
template <class A>
recognizer<A> zero_or_more (recognizer<A> p)
{
  return [=] (std::list<A> const& toks) -> remaining <A>
    {
      return compose_or (
         compose_and (p, zero_or_more<A> (p)) , epsilon<A> ()) (toks);
    };
}
Well, let's now get concrete. Here's a function to produce a recognizer of a specific token.
//A function to produce a recognizer that recognizes only the
//given character
template <A>
recognizer<A> char_ (A c)
{
  return recognizer_of_token<A>([=](A x) -> bool { return c == x; });
}
The composite recognizer for a given "string" of tokens can be constructed then like this.
//A function to produce a recognizer of a specific string
template <class C>
recognizer<typename boost::range_value<C>::type> string_ (C s)
{
  typedef typename boost::range_value<C>::type value;

  std::list <recognizer<value> > rs;

  typedef std::back_insert_iterator<std::list<recognizer<value> > > it_t;

  std::accumulate (
      boost::begin (s)
    , boost::end (s)
    , std::back_inserter (rs)
    , [](it_t dst, value c) -> it_t 
      { *dst++ = char_ (c); return dst; });

  return std::accumulate (
      boost::next (rs.begin ())
    , rs.end ()
    , rs.front ()
    , [](recognizer<value> acc, recognizer<value> r) -> recognizer<value>
      { return compose_and (acc, r); });
}
That will do for our mini-recognizer library for today. Let's turn attention to testing. First some utilities.
//Match on a remaining<A> returns 'true' if it contains a 'remains<A>'
//value with all input consumed, 'false' if it contains a 'recognition_fails<A>' value
template <class A>
struct accept_visitor
{
  typedef bool result_type;
  bool operator () (remains<A> const& r) const { return r.left.empty (); }
  bool operator () (recognition_fails<A> const& r) const { return false; }
};

//Function to determine if recognition was achieved
template <class A>
bool accept (remaining<A> const& r)
{
  return boost::apply_visitor( accept_visitor<A> (), r);
}

//Test if the provided recognizer can recognize the given string
bool parse (recognizer<char> parser, std::string const& s)
{
  return accept (parser (std::list<char>(s.begin (), s.end())));
}
Now, a little test.
int main ()
{
  //char_
  BOOST_TEST( parse (char_ ('a'), "a") );
  BOOST_TEST( !parse (char_ ('b'), "a") );
  BOOST_TEST( parse (char_ ('b'), "b") );

  //'*'
  BOOST_TEST( parse (zero_or_more (char_ ('a')), "aa") );
  BOOST_TEST( parse (zero_or_more (char_ ('a')), "") );
  BOOST_TEST( !parse (zero_or_more (char_ ('a')), "ab") );

  //string_
  BOOST_TEST (parse (string_ (std::string ("foo")), "foo") );
  BOOST_TEST (!parse (string_ (std::string ("foo")), "bar") );
  
  return boost::report_errors ();
}
Et voilĂ . L' approche fonctionnelle de C++! See you soon for more about regular expressions. Happy holidays!