Shayne Fletcher: May 2015

Saturday, May 30, 2015

Run length encoding data compression method

Functional programming and lists go together like Fred and Ginger. This little exercise is one of Werner Hett's "Ninety-Nine Prolog Problems". The idea is to implement the run length encoding data compression method.

Here's how we start. First we write a function that packs consecutive duplicates of a list into sublists e.g.

# B.pack ['a'; 'a'; 'a'; 'b'; 'c'; 'c'; 'd'] ;;
- : char list list = [['a'; 'a'; 'a']; ['b']; ['c'; 'c']; ['d'];]

Then, consecutive duplicates of elements are encoded as terms $(N, E)$ where $N$ is the number of duplicates of the element $E$ e.g.

# B.rle (B.pack ['a'; 'a'; 'a'; 'b'; 'c'; 'c'; 'd'; 'e'; 'e']) ;;
- : (int * char) list = [(3, 'a'); (1, 'b'); (2, 'c'); (1, 'd'); (2, 'e')]

We will of course require a function to decode compressed data e.g.

 # B.unrle(B.rle(
     B.pack ['a'; 'a'; 'a'; 'b'; 'c'; 'c'; 'd'; 'e'; 'e'])
    ) ;;
- : char list = ['a'; 'a'; 'a'; 'b'; 'c'; 'c'; 'd'; 'e'; 'e']

(Credit goes to Harvey Stein for the names rle and unrle by the way).

So that's it for the first iteration - here's some code that aims to implement these specifications.

module B = struct

let pack (x : α list) : α list list =
  let f (acc : α list list) (c : α) : α list list =
    match acc with
    | (((b :: _) as hd) :: tl) when c = b -> (c :: hd) :: tl
    |  _ -> [c] :: acc
  in List.fold_left f [] x

let rle (x : α list list) : (int * α) list =
  let f (acc : (int * α) list) (l : α list) : (int * α) list =
    (List.length l, List.hd l) :: acc
  in List.fold_left f [] x
  
let unrle (data : (int * α) list) =
  let repeat ((n : int), (c : α)) : α list =
    let rec aux acc i = if i = 0 then acc else aux (c :: acc) (i - 1) in
    aux [] n in
  let f (acc : α list) (elem : (int * α)) : α list =
    acc @ (repeat elem) in
  List.fold_left f [] data
  
end

Now, pack is just a device of course. We don't really need it so here's the next iteration that does away with it.

module E = struct

let rle (x : α list) : (int * α) list =
  let f (acc : (int * α) list) (c : α) : (int * α) list =
    match acc with
    | ((n, e) :: tl) when e = c -> (n + 1, c):: tl
    | _-> (1, c) :: acc
  in List.rev (List.fold_left f [] x)

let unrle (data : (int * α) list) =
  let repeat ((n : int), (c : α)) : α list =
    let rec aux acc i = if i = 0 then acc else aux (c :: acc) (i - 1) in
    aux [] n in
  let f (acc : α list) (elem : (int * α)) : α list =
    acc @ (repeat elem) in
  List.fold_left f [] data

end

Nifty!

Ok, the next idea is that when a singleton byte is encountered, we don't write the term $(1, E)$ instead, we just write $E$. Now OCaml doesn't admit heterogenous lists like Prolog appears to do so we need a sum type for the two possibilities. This then is the final version.

module F = struct

  type α t = | S of α | C of (int * α)

  let rle (bytes : α list) : α t list =
    let f (acc : α t list) (b : α) : α t list =
      match acc with
      | ((S e) :: tl) when e = b -> (C (2, e)) :: tl
      | ((C (n, e)) :: tl) when e = b -> (C (n + 1, b)) :: tl
      | _-> S b :: acc
    in List.rev (List.fold_left f [] bytes)

  let unrle (data : (α t) list) =
    let rec aux (acc : α list) (b : α) : (int -> α list) = 
      function | 0 -> acc | i -> aux (b :: acc) b (i - 1) in
    let f (acc : α list) (e : α t) : α list =
      acc @ (match e with | S b -> [b]| C (n, b) -> aux [] b n) in
    List.fold_left f [] data

end

Having worked out the details in OCaml, translation into C++ is reasonably straight-forward. One economy granted by this language is that we can do away with the data constructor S in this version.

#include <boost/variant.hpp>
#include <boost/variant/apply_visitor.hpp>
#include <boost/range.hpp>
#include <boost/range/numeric.hpp>

#include <list>

//Representation of the encoding
template <class A> struct C { std::pair <int, A> item; };
template <class A> using datum = boost::variant <A, C<A>>;
template <class A> using encoding = std::list<datum<A>>;

//Procedural function object that updates an encoding given a
//datum
template <class A>
struct update : boost::static_visitor<> {
  A c;
  encoding<A>& l;

  update (A c, encoding<A>& l) : c (c), l (l) 
  {}

  void operator ()(A e) const { 
    if (e == c) {
      l.back () = C<A>{ std::make_pair(2, c) }; 
      return;
    }
    l.push_back (c);
  }

  void operator ()(C<A> const& elem) const { 
    if (elem.item.second == c) {
      l.back () = C<A>{ std::make_pair (elem.item.first + 1, c) };
      return;
    }
    l.push_back (c);
  }
};

template <class R>
encoding<typename boost::range_value<R>::type> rle (R bytes) {
  typedef boost::range_value<R>::type A;

  auto f = [](encoding<A> acc, A b) -> encoding<A> {
    if (acc.size () == 0)
      acc.push_back (b);
    else   {
      boost::apply_visitor (update<A>(b, acc), acc.back ());
    }
    return acc;
  };

  return boost::accumulate (bytes, encoding<A> (), f);
}

I've left implementing unrle () as an exercise. Here's a little test though that confirms that we are getting savings in from the compression scheme as we hope for.

int main () {
  std::string buf=
    "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
    "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"
    "c"
    "ddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd"
    "ddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd"
    "ddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd"
    "ddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd"
    "ddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd"
    "ddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd"
    "eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee"
    "z";
  std::list<char> data(buf.begin (), buf.end());
  encoding<char> compressed = rle (data);

  std::cout << sizeof (char) * (data.size ()) << std::endl;
  std::cout << sizeof (datum <char>) * (compressed.size ()) << std::endl;

  return 0;
}

On my machine, this program prints the values $484$ and $72$.

Sunday, May 24, 2015

Church Numerals

This is just a little fun. Jason Hickey in "Introduction to Objective Caml" poses some little end of chapter problems to define arithmetic operations for a type of unary (base-1) natural numbers. The type is

type num = Z | S of num

where Z represents the number zero and if i is a unary number, then S i is i + 1.

This formulation of Church numerals using a recursive type and pattern matching means in truth, the problems can be solved in less than 5 minutes or so. Of course, the real Church numerals are numbers encoded in functions

$c_{0} = \lambda s.\lambda z.\;z$
$c_{1} = \lambda s.\lambda z.\;s\;z$
$c_{2} = \lambda s.\lambda z.\;s\;(s\;z)$
$c_{3} = \lambda s.\lambda z.\;s\;(s\;(s\;z))$
$\cdots$

and their represenation including arithmetic operations can be formulated in OCaml too (and it's a good exercise but harder than what we are going to do here -- if you'd like to see more about that, have a look at this Cornell lecture).

Alright, without further ado, here we go then.

type num = Z | S of num

let scc (x : num) : num = S x
let prd : num -> num = function | S n -> n | _ -> Z

let rec add (x : num) (y : num) : num =
  match (x, y) with 
  | (Z, _) -> y
  | (_, Z) -> x
  | (S m, n) -> scc (add m n)

let rec sub (x : num) (y : num) : num =
  match (x, y) with
  | (Z, _) -> Z
  | (n, Z) -> n
  | (S m, n) -> sub m (prd n)

let rec mul (x : num) (y : num) : num = 
  match (x, y) with
  | (Z, _) -> Z
  | (_, Z) -> Z
  | (S Z, x) -> x
  | (x, S Z) -> x
  | (S m, n) -> add (mul m n) n

let rec to_int : num -> int = function | Z -> 0 | S n -> 1 + to_int n
let rec from_int (x : int)  = if x = 0 then Z else scc (from_int (x - 1))

For example, in the top-level we can write things like,

# to_int (mul (sub (from_int 23) (from_int 11)) (from_int 2));;
- : int = 24

The main thing I find fun about this little program though is how obvious its mapping to C++. Of course you need a discriminated union type my default choice being boost::variant<> (by the way, standardization of a variant type for C++ is very much under active discussion and development, see N4450 for example from April this year - it seems that support for building recursive types might not be explicitly provided though... That would be a shame in my opinion and if that's the case, I beg the relevant parties to reconsider!).

#include <boost/variant.hpp>
#include <boost/variant/apply_visitor.hpp>

#include <stdexcept>
#include <iostream>

struct Z;
struct S;

typedef boost::variant<Z, boost::recursive_wrapper<S>> num;

struct Z {};
struct S { num i; };

int to_int (num const& i);

struct to_int_visitor 
  : boost::static_visitor<int> {
  int operator ()(Z const& n) const { return 0; }
  int operator ()(S const& n) const { return 1 + to_int (n.i); }
};

int to_int (num const& i) {
  return boost::apply_visitor (to_int_visitor (), i);
}

num from_int (int i) {
  if (i == 0){
    return Z {};
  }
  else{
    return S {from_int (i - 1)};
  }
}

num add (num l, num r);

struct add_visitor : boost::static_visitor<num> {
  num operator () (Z, S s) const { return s; }
  num operator () (S s, Z) const { return s; }
  num operator () (Z, Z) const { return Z {}; }
  num operator () (S s, S t) const { return S { add (s.i, t) }; }
};

num add (num l, num r) {
  return boost::apply_visitor (add_visitor (), l, r);
}

num succ (num x) { return S{x}; }

struct prd_visitor : boost::static_visitor<num>{
  num operator () (Z z) const { return z; }
  num operator () (S s) const { return s.i; }
};

num prd (num x) {
  return boost::apply_visitor(prd_visitor (), x);
}

num sub (num x, num y);

struct sub_visitor : boost::static_visitor<num> {
  num operator () (Z, Z) const { return Z {}; }
  num operator () (Z, S) const { return Z {}; }
  num operator () (S m, Z) const { return m; }
  num operator () (S m, S n) const { return sub (m.i, prd (n)); }
};

num sub (num x, num y) {
  return boost::apply_visitor (sub_visitor (), x, y);
}

//Tests

int main () {

  num zero = Z {};
  num one = succ (zero);
  num two = succ (succ (zero));
  num three = succ (succ (succ (zero)));

  std::cout << to_int (add (two, three)) << std::endl;
  std::cout << to_int (sub (from_int (23), from_int (12))) << std::endl;

  return 0;
}

I didn't get around to implementing mul in the above. Consider it an "exercise for the reader"!