This post picks up from here which was concerned with parsing - obtaining representations of regular expressions as abstract syntax trees. The ultimate goal is, given a string representation of a regular expression $e$ , produce a 'recognizer' for the expression (referred to as compiling a regular expression). That is, a function string -> bool
that can be used to categorize strings as either belonging to the language $\mathcal{L_{e}}$ or not.
Having produced an abstract syntax tree for a regular expression $e$, the first step in compiling the expression is to compute an abstract syntax tree of the corresponding augmented regular expression $(e)\#$. This augmented regular expression is the original expression $e$ concatenated with a unique end-marker $\#$. For the given expression $e$, the accepting state for $e$ is given a transition on $\#$. This is a device that allows us to "forget" about accepting states as the computation of a recognizer proceeds; when the construction is complete, any state with a transition on $\#$ must be an accepting state.
Leaves in the abstract syntax tree of the augmented regular expression $(e)\#$ are labeled by $\epsilon$ or a symbol from from $\mathcal{A}$. For those non-$\epsilon$ leaves we attach a unique integer. Accordingly, we will need functions to generate unique integers (positions) that we will employ as we transform the AST of $e$ into the AST of the augmented expression $(e)\#$ leading to our first code example.
let reset_label, generate_label = let r = ref (-1) in ((fun () -> r := (-1)), (fun () -> r := !r + 1; !r))
As we construct the syntax tree of the $(e)\#$ we compute four functions : null_pos, first_pos, last_pos and following:
null_pos
is $true$ for a syntax-tree node $n$ if and only if the sub-expression represented by $n$ has $\epsilon$ in its language. That is, $true$ if the regular sub-expression recognizes the empty string and $false$ otherwise;first_pos
is the set of positions in the sub-tree rooted at $n$ that correspond to the first symbol of at least one string in the language of the sub-expression rooted at $n$. That is, the set of symbols that can begin a string recognized by the regular sub-expression;last_pos
is the set of positions in the sub-tree rooted at the syntax-tree node $n$ that corresponds to the last symbol of at least one string in the language of the sub-expression rooted at $n$. That is, the set of symbols that can terminate a string recognized by the regular sub-expression;following
, for a position $p$ is the set of positions $q$ in the entire syntax-tree such that there is some string $x = a_{1}a_{2} \cdots a_{n}$ in $\mathcal{L_{(e)\#}}$ such that for some $i$, there is some way to explain the membership of $x$ in $\mathcal{L_{(e)\#}}$ by matching $a_{i}$ to $p$ and $a_{i + 1}$ to a position in $q$.
following
is the last to compute as it relies upon the values of first_pos
and last_pos
. If the definition is confusing for now, don't worry about it. The rules for computing following
will come later and will be obvious at that point. We'll focus for now on null_pos
, first_pos
and last_pos
.
The results of first_pos
, last_pos
and follow
are sets of integers. Accordingly, we are going to need a type to represent these.
module Int_set : Set.S with type elt = int = Set.Make ( struct let compare = Pervasives.compare type t = int end)With this we can present the type of ASTs for augmented regular expressions.
type augmented_regexp = | Epsilon | Character of char * int | Sequence of augmented_regexp * augmented_regexp * pos | Alternative of augmented_regexp * augmented_regexp * pos | Repetition of augmented_regexp * pos | Accept of int and pos = { null:bool; first:Int_set.t; last:Int_set.t; }
For a given node $n$, the values of its pos
record depend only on the sub-expressions of that node. Assuming constructed augmented regular expression syntax trees, we can write null_pos
, first_pos
and last_pos
like this.
let (null_pos : augmented_regexp -> bool) = fun x -> match x with | Epsilon -> true | Character (_, i) -> false | Sequence (_, _, p) -> p.null | Alternative (_, _, p) -> p.null | Repetition (_, p) -> p.null | Accept _ -> false let (first_pos : augmented_regexp -> Int_set.t) = fun x -> match x with | Epsilon -> Int_set.empty | Character (_, i) -> Int_set.add i (Int_set.empty) | Alternative (_, _, p) -> p.first | Repetition (_, p) -> p.first | Sequence (_, _, p) -> p.first | Accept i -> Int_set.add i (Int_set.empty) let (last_pos : augmented_regexp -> Int_set.t) = fun x -> match x with | Epsilon -> Int_set.empty | Character (_, i) -> Int_set.add i (Int_set.empty) | Alternative (_, _, p) -> p.last | Repetition (_, p) -> p.last | Sequence (_, _, p) -> p.last | Accept i -> Int_set.add i (Int_set.empty)
Our strategy in building the syntax-tree of $(e)\#$ from the syntax tree of $e$ will be to visit each node of $e$ and invoke a function to construct the corresponding node of $(e)\#$ inductively. These functions will include the generation of unique integers for the non-$\epsilon$ leaves and encode the rules for building the pos
records:
-
null
Sequence
$(e_{1}, e_{2})$ :null_pos
$e_{1}$ andnull_pos
$e_{2}$Alternative
$(e_{1}, e_{2})$ :null_pos
$e_{1}$ ornull_pos
$e_{2}$Repetition
: $true$
-
first
Alternative
$(e_{1}, e_{2})$ :first_pos
$e_{1} \cup$first_pos
$e_{2}$Sequence
$(e_{1}, e_{2})$ : ifnull_pos
$e_{1}$ thenfirst_pos
$e_{1} \cup$first_pos
$e_{2}$ elsefirst_pos
$e_{1}$Repetition
$e$ :first_pos
$e$
-
last
Alternative
$(e_{1}, e_{2})$ :last_pos
$e_{1} \cup$last_pos
$e_{2}$Sequence
$(e_{1}, e_{2})$ : ifnull_pos
$e_{2}$ thenlast_pos
$e_{1} \cup$last_pos
$e_{2}$ elselast_pos
$e_{2}$Repetition
$e$ :last_pos
$e$
Here then are the augmented regular expression syntax-tree node constructor functions.
let (epsilon : unit -> augmented_regexp) = fun () -> Epsilon and (character : char -> augmented_regexp) = fun c -> Character (c, generate_label ()) and (repetition : augmented_regexp -> augmented_regexp) = fun e -> Repetition (e, {null=true;first=first_pos e; last=last_pos e}) and (alternative : augmented_regexp -> augmented_regexp -> augmented_regexp) = fun e1 e2 -> Alternative (e1, e2, {null=null_pos e1 || null_pos e2; first=Int_set.union (first_pos e1)(first_pos e2); last=Int_set.union (last_pos e1) (last_pos e2)}) and (sequence : augmented_regexp -> augmented_regexp -> augmented_regexp) = fun e1 e2 -> let b1 = null_pos e1 and b2 = null_pos e2 in Sequence (e1, e2, {null=b1 && b2; first= if b1 then Int_set.union (first_pos e1)(first_pos e2) else first_pos e1; last= if b2 then Int_set.union (last_pos e1) (last_pos e2) else last_pos e2}) let (accept : augmented_regexp -> augmented_regexp) = fun e -> sequence e (Accept (generate_label ()))
We are now in a position to write the function that transforms a syntax tree of the regular expression $e$ into the syntax tree of the augmented regular expression $(e)\#$.
let rec (augmented_regexp : Syntax.regular_expression -> augmented_regexp) = fun x -> match x with | Syntax.Epsilon -> epsilon () | Syntax.Character i -> character (Char.chr i) | Syntax.Sequence (x, y) -> (*Be very careful here. Evaluation order matters!*) let x' = (augmented_regexp x) and y' = (augmented_regexp y) in sequence x' y' | Syntax.Alternative (x, y) -> (*Be very careful here. Evaluation order matters!*) let x' = (augmented_regexp x) and y' = (augmented_regexp y) in alternative x' y' | Syntax.Repetition x -> repetition (augmented_regexp x)
We can wrap all of the above up in a convenience function parse_augmented_regexp
which first parses a string to build the syntax tree of the regular expression it represents and then transforms the result into the syntax tree of the corresponding augmented regular expression.
let (parse_augmented_regexp : string-> augmented_regexp * int) = fun s -> let () = reset_label () in let ast = regexp_of_string s in let re1 = augmented_regexp ast in let re2 = accept re1 in let count = generate_label () in (re2, count)Notice that this function returns a pair of the syntax-tree and the number of positions it contains.
The next step in compiling a recognizer from the expression $(e)\#$ is to compute the follow
function. To do this we "unite" the information encoded by the first_pos
and last_pos
functions. Put plainly, follow
is a function that takes each symbol (position) in the regular expression to the (set of) symbols (positions) that can follow it. The information is stored in an array of length equal to the number of symbols appearing in the regular expression. There are only two ways a position in a regular expression can follow another:
- If $n$ is a
Sequence
node with left child $c_{1}$ and right child $c_{2}$, then for every position $i$ inlastpos
$c_{1}$, all positions infirstpos
$c_{2}$ are infollow_pos
$i$ - If $n$ is a
Repition
and $i$ a position inlastpos
$n$, then all positions infirst_pos
$n$ are infollow_pos
$i$
follow
the code below also stores the association between positions and characters of the regular expression. That information goes into an array. The elements of the array have type char option
since the Accept
symbol has a position but no character associated with it.
let (compute_follow : Int_set.t array -> char option array -> augmented_regexp -> unit) = fun follow chars x -> let rec compute x = match x with | Sequence (e1, e2, p) -> compute e1; compute e2; let first2 = first_pos e2 in let f i = follow.(i) <- Int_set.union first2 (follow.(i)) in Int_set.iter f (last_pos e1) | Repetition (e, p) -> compute e; let f i = follow.(i) <- Int_set.union (p.first) (follow.(i)) in Int_set.iter f (p.last) | Alternative (e1, e2, p) -> compute e1; compute e2 | Epsilon -> () | Accept i -> chars.(i) <- None | Character (c, i) -> chars.(i) <- Some c in compute x
Now the computation of the augmented regular expression syntax-tree and all four of the associated functions together with the mapping from positions to symbols of $\mathcal{A}$ can be wrapped up in another "high-level" convenience function.
let (regexp_follow : string -> augmented_regexp * Int_set.t array * char option array) = fun s -> let re, n = parse_augmented_regexp s in let follow = Array.make n (Int_set.empty) in let chars = Array.make n None in compute_follow follow chars re; (re, follow, chars)
We're in good shape - but a hop, skip and a jump to computing a recognizer from a regular expression. We'll leave off here on this local maxima for today and finish off the job in the next post!