Saturday, May 30, 2015

Run length encoding data compression method

Functional programming and lists go together like Fred and Ginger. This little exercise is one of Werner Hett's "Ninety-Nine Prolog Problems". The idea is to implement the run length encoding data compression method.

Here's how we start. First we write a function that packs consecutive duplicates of a list into sublists e.g.

# B.pack ['a'; 'a'; 'a'; 'b'; 'c'; 'c'; 'd'] ;;
- : char list list = [['a'; 'a'; 'a']; ['b']; ['c'; 'c']; ['d'];]
Then, consecutive duplicates of elements are encoded as terms $(N, E)$ where $N$ is the number of duplicates of the element $E$ e.g.
# B.rle (B.pack ['a'; 'a'; 'a'; 'b'; 'c'; 'c'; 'd'; 'e'; 'e']) ;;
- : (int * char) list = [(3, 'a'); (1, 'b'); (2, 'c'); (1, 'd'); (2, 'e')]
We will of course require a function to decode compressed data e.g.
 # B.unrle(B.rle(
     B.pack ['a'; 'a'; 'a'; 'b'; 'c'; 'c'; 'd'; 'e'; 'e'])
    ) ;;
- : char list = ['a'; 'a'; 'a'; 'b'; 'c'; 'c'; 'd'; 'e'; 'e']
(Credit goes to Harvey Stein for the names rle and unrle by the way).

So that's it for the first iteration - here's some code that aims to implement these specifications.
module B = struct

let pack (x : α list) : α list list =
  let f (acc : α list list) (c : α) : α list list =
    match acc with
    | (((b :: _) as hd) :: tl) when c = b -> (c :: hd) :: tl
    |  _ -> [c] :: acc
  in List.fold_left f [] x

let rle (x : α list list) : (int * α) list =
  let f (acc : (int * α) list) (l : α list) : (int * α) list =
    (List.length l, List.hd l) :: acc
  in List.fold_left f [] x
  
let unrle (data : (int * α) list) =
  let repeat ((n : int), (c : α)) : α list =
    let rec aux acc i = if i = 0 then acc else aux (c :: acc) (i - 1) in
    aux [] n in
  let f (acc : α list) (elem : (int * α)) : α list =
    acc @ (repeat elem) in
  List.fold_left f [] data
  
end

Now, pack is just a device of course. We don't really need it so here's the next iteration that does away with it.

module E = struct

let rle (x : α list) : (int * α) list =
  let f (acc : (int * α) list) (c : α) : (int * α) list =
    match acc with
    | ((n, e) :: tl) when e = c -> (n + 1, c):: tl
    | _-> (1, c) :: acc
  in List.rev (List.fold_left f [] x)

let unrle (data : (int * α) list) =
  let repeat ((n : int), (c : α)) : α list =
    let rec aux acc i = if i = 0 then acc else aux (c :: acc) (i - 1) in
    aux [] n in
  let f (acc : α list) (elem : (int * α)) : α list =
    acc @ (repeat elem) in
  List.fold_left f [] data

end

Nifty!

Ok, the next idea is that when a singleton byte is encountered, we don't write the term $(1, E)$ instead, we just write $E$. Now OCaml doesn't admit heterogenous lists like Prolog appears to do so we need a sum type for the two possibilities. This then is the final version.

module F = struct

  type α t = | S of α | C of (int * α)

  let rle (bytes : α list) : α t list =
    let f (acc : α t list) (b : α) : α t list =
      match acc with
      | ((S e) :: tl) when e = b -> (C (2, e)) :: tl
      | ((C (n, e)) :: tl) when e = b -> (C (n + 1, b)) :: tl
      | _-> S b :: acc
    in List.rev (List.fold_left f [] bytes)

  let unrle (data : (α t) list) =
    let rec aux (acc : α list) (b : α) : (int -> α list) = 
      function | 0 -> acc | i -> aux (b :: acc) b (i - 1) in
    let f (acc : α list) (e : α t) : α list =
      acc @ (match e with | S b -> [b]| C (n, b) -> aux [] b n) in
    List.fold_left f [] data

end

Having worked out the details in OCaml, translation into C++ is reasonably straight-forward. One economy granted by this language is that we can do away with the data constructor S in this version.

#include <boost/variant.hpp>
#include <boost/variant/apply_visitor.hpp>
#include <boost/range.hpp>
#include <boost/range/numeric.hpp>

#include <list>

//Representation of the encoding
template <class A> struct C { std::pair <int, A> item; };
template <class A> using datum = boost::variant <A, C<A>>;
template <class A> using encoding = std::list<datum<A>>;

//Procedural function object that updates an encoding given a
//datum
template <class A>
struct update : boost::static_visitor<> {
  A c;
  encoding<A>& l;

  update (A c, encoding<A>& l) : c (c), l (l) 
  {}

  void operator ()(A e) const { 
    if (e == c) {
      l.back () = C<A>{ std::make_pair(2, c) }; 
      return;
    }
    l.push_back (c);
  }

  void operator ()(C<A> const& elem) const { 
    if (elem.item.second == c) {
      l.back () = C<A>{ std::make_pair (elem.item.first + 1, c) };
      return;
    }
    l.push_back (c);
  }
};

template <class R>
encoding<typename boost::range_value<R>::type> rle (R bytes) {
  typedef boost::range_value<R>::type A;

  auto f = [](encoding<A> acc, A b) -> encoding<A> {
    if (acc.size () == 0)
      acc.push_back (b);
    else   {
      boost::apply_visitor (update<A>(b, acc), acc.back ());
    }
    return acc;
  };

  return boost::accumulate (bytes, encoding<A> (), f);
}

I've left implementing unrle () as an exercise. Here's a little test though that confirms that we are getting savings in from the compression scheme as we hope for.

int main () {
  std::string buf=
    "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
    "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"
    "c"
    "ddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd"
    "ddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd"
    "ddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd"
    "ddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd"
    "ddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd"
    "ddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd"
    "eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee"
    "z";
  std::list<char> data(buf.begin (), buf.end());
  encoding<char> compressed = rle (data);

  std::cout << sizeof (char) * (data.size ()) << std::endl;
  std::cout << sizeof (datum <char>) * (compressed.size ()) << std::endl;

  return 0;
}
On my machine, this program prints the values $484$ and $72$.

Sunday, May 24, 2015

Church Numerals

This is just a little fun. Jason Hickey in "Introduction to Objective Caml" poses some little end of chapter problems to define arithmetic operations for a type of unary (base-1) natural numbers. The type is

type num = Z | S of num
where Z represents the number zero and if i is a unary number, then S i is i + 1.

This formulation of Church numerals using a recursive type and pattern matching means in truth, the problems can be solved in less than 5 minutes or so. Of course, the real Church numerals are numbers encoded in functions

  • $c_{0} = \lambda s.\lambda z.\;z$
  • $c_{1} = \lambda s.\lambda z.\;s\;z$
  • $c_{2} = \lambda s.\lambda z.\;s\;(s\;z)$
  • $c_{3} = \lambda s.\lambda z.\;s\;(s\;(s\;z))$
  • $\cdots$
and their represenation including arithmetic operations can be formulated in OCaml too (and it's a good exercise but harder than what we are going to do here -- if you'd like to see more about that, have a look at this Cornell lecture).

Alright, without further ado, here we go then.

type num = Z | S of num

let scc (x : num) : num = S x
let prd : num -> num = function | S n -> n | _ -> Z

let rec add (x : num) (y : num) : num =
  match (x, y) with 
  | (Z, _) -> y
  | (_, Z) -> x
  | (S m, n) -> scc (add m n)

let rec sub (x : num) (y : num) : num =
  match (x, y) with
  | (Z, _) -> Z
  | (n, Z) -> n
  | (S m, n) -> sub m (prd n)

let rec mul (x : num) (y : num) : num = 
  match (x, y) with
  | (Z, _) -> Z
  | (_, Z) -> Z
  | (S Z, x) -> x
  | (x, S Z) -> x
  | (S m, n) -> add (mul m n) n

let rec to_int : num -> int = function | Z -> 0 | S n -> 1 + to_int n
let rec from_int (x : int)  = if x = 0 then Z else scc (from_int (x - 1))
For example, in the top-level we can write things like,
# to_int (mul (sub (from_int 23) (from_int 11)) (from_int 2));;
- : int = 24

The main thing I find fun about this little program though is how obvious its mapping to C++. Of course you need a discriminated union type my default choice being boost::variant<> (by the way, standardization of a variant type for C++ is very much under active discussion and development, see N4450 for example from April this year - it seems that support for building recursive types might not be explicitly provided though... That would be a shame in my opinion and if that's the case, I beg the relevant parties to reconsider!).

#include <boost/variant.hpp>
#include <boost/variant/apply_visitor.hpp>

#include <stdexcept>
#include <iostream>

struct Z;
struct S;

typedef boost::variant<Z, boost::recursive_wrapper<S>> num;

struct Z {};
struct S { num i; };

int to_int (num const& i);

struct to_int_visitor 
  : boost::static_visitor<int> {
  int operator ()(Z const& n) const { return 0; }
  int operator ()(S const& n) const { return 1 + to_int (n.i); }
};

int to_int (num const& i) {
  return boost::apply_visitor (to_int_visitor (), i);
}

num from_int (int i) {
  if (i == 0){
    return Z {};
  }
  else{
    return S {from_int (i - 1)};
  }
}

num add (num l, num r);

struct add_visitor : boost::static_visitor<num> {
  num operator () (Z, S s) const { return s; }
  num operator () (S s, Z) const { return s; }
  num operator () (Z, Z) const { return Z {}; }
  num operator () (S s, S t) const { return S { add (s.i, t) }; }
};

num add (num l, num r) {
  return boost::apply_visitor (add_visitor (), l, r);
}

num succ (num x) { return S{x}; }

struct prd_visitor : boost::static_visitor<num>{
  num operator () (Z z) const { return z; }
  num operator () (S s) const { return s.i; }
};

num prd (num x) {
  return boost::apply_visitor(prd_visitor (), x);
}

num sub (num x, num y);

struct sub_visitor : boost::static_visitor<num> {
  num operator () (Z, Z) const { return Z {}; }
  num operator () (Z, S) const { return Z {}; }
  num operator () (S m, Z) const { return m; }
  num operator () (S m, S n) const { return sub (m.i, prd (n)); }
};

num sub (num x, num y) {
  return boost::apply_visitor (sub_visitor (), x, y);
}

//Tests

int main () {

  num zero = Z {};
  num one = succ (zero);
  num two = succ (succ (zero));
  num three = succ (succ (succ (zero)));

  std::cout << to_int (add (two, three)) << std::endl;
  std::cout << to_int (sub (from_int (23), from_int (12))) << std::endl;

  return 0;
}
I didn't get around to implementing mul in the above. Consider it an "exercise for the reader"!