Here's a brief refresher on regular expressions derived from Cosineau and Mauny.

Given an alphabet $\mathcal{A}$, a regular expression $e$ over $\mathcal{A}$ is a concise way to describe a set $\mathcal{L}$ of strings of elements of $\mathcal{A}$. The set $\mathcal{L}_{e}$ denoted by an expression $e$ is termed the *regular language* denoted by $e$.

In their most basic form, regular expressions can be defined like this:

- $\epsilon$ represents the set $\mathcal{L}_{\epsilon} = \left\{\epsilon\right\}$ containing only the empty string
- A symbol $c$ of $\mathcal{A}$ denotes $\mathcal{L}_{c} = \left\{c\right\}$
- $e_{1}e_{2}$ is the set $\mathcal{L}_{e_{1}}\cdot\mathcal{L}_{e_{2}} = \left\{s_{1}s_{2} | s_{1} \in \mathcal{L}_{e_{1}}, s_{2} \in \mathcal{L}_{e_{2}}\right\}$
- $e_{1}|e_{2}$ is the set $\mathcal{L}_{e_{1}}\bigcup \mathcal{L}_{e_{2}}$
- $e^{*}$ is the union $\bigcup_{i=0}^{\infty}\mathcal{L}_{e}^{i}$ where $\mathcal{L}_{e}^{0} = \mathcal{L}_{\epsilon}$ and $\mathcal{L}_{e}^{i} = \mathcal{L}_{e}^{i-1}\cdot\mathcal{L}_{e}$

*Kleene closure operator*and $e^{*}$ equivalent to the infinite regular expression $\epsilon|e|ee|eee|\dots$

Regular expressions provide a convenient syntax for searching plain text data for lines matching specific patterns as demonstrated by the venerable Unix function 'grep' (I was taught grep is an acronym for **g**eneralized **r**egular **e**xpression **p**attern-matching). To produce a function like grep requires a program for analyzing a regular expression $e$ and computing a recognizer for $\mathcal{L}_{e}$, that is a function `string -> bool`

, that can be used to classify a string $s$ (say) in terms of whether or not $s \in \mathcal{L}_{e}$. This procedure of computing a recognizer for $\mathcal{L}_{e}$ from $e$ is called *compiling* a regular expression.

Writing a regular expression compiler is not for the faint-of-heart! Doing so requires more work than can be reasonably described in one blog-post so here, I'll focus only on the first part - parsing a regular expression. That is, given a string $s$ a program to produce an abstract syntax tree describing the regular expression $e$ represented by $s$.

To begin, we define the type of regular expressions.

type regular_expression = | Epsilon | Character of int | Sequence of regular_expression * regular_expression | Alternative of regular_expression * regular_expression | Repetition of regular_expressionWe could proceed from here by hand-crafting a recursive descent parser or we might take advantage of a parser generator system like ocamllex/ocamlyacc which is the choice I favored. On reaching that decision it occurred to me that, since ocamllex itself treats regular expressions as inputs, the source code for ocamllex itself might contain a suitable grammar for the expression type above and indeed it does! Things at this point become marvelously recursive - to build a regular expression compiler we can bootstrap (at both the source and binary levels) from an existing regular expression compiler (which itself is a bootstrap by the way)!

The code that follows is an adaptation of the parser and lexer input files for ocamllex (the copyright of Xavier Leroy).

/*parser.mly*/ %{ open Syntax %} %token <int> Tchar %token Tend Tor Tlbracket Trbracket %token Tstar Tmaybe Tplus Tlparen Trparen %left Tor %nonassoc CONCAT %nonassoc Tmaybe Tstar Tplus %nonassoc Tchar Tlbracket Tlparen %start main %type <Syntax.regular_expression> main %% main: | regexp Tend { $1 } ; regexp: | Tchar { Character $1 } | regexp Tstar { Repetition $1 } | regexp Tmaybe { Alternative (Epsilon, $1) } | regexp Tplus { Sequence (Repetition ($1), $1)} | regexp Tor regexp { Alternative ($1, $3) } | regexp regexp %prec CONCAT { Sequence ($1, $2) } | Tlparen regexp Trparen { $2 } ;Here is the corresponding lexer input file.

{ open Parser let incr_loc lexbuf delta = let pos = lexbuf.Lexing.lex_curr_p in lexbuf.Lexing.lex_curr_p <- { pos with Lexing.pos_lnum = pos.Lexing.pos_lnum + 1; Lexing.pos_bol = pos.Lexing.pos_cnum - delta; } let char_for_backslash = function | 'n' -> '\010' | 'r' -> '\013' | 'b' -> '\008' | 't' -> '\009' | c -> c } let special_chars = ['|' '*' '?' '+' '(' ')'] let backslash_escapes = ['\\' '\'' '"' 'n' 't' 'b' 'r' ' '] rule main = parse [' ' '\013' '\009' '\012' ] + { main lexbuf } | '\010' { incr_loc lexbuf 0; main lexbuf } | [^ '\\' '|' '*' '?' '+' '(' ')'] { Tchar(Char.code (Lexing.lexeme_char lexbuf 0)) } | '\\' backslash_escapes { Tchar (Char.code (char_for_backslash (Lexing.lexeme_char lexbuf 1))) } | '\\' special_chars { Tchar (Char.code (Lexing.lexeme_char lexbuf 1)) } | '|' { Tor } | '*' { Tstar } | '?' { Tmaybe } | '+' { Tplus } | '(' { Tlparen } | ')' { Trparen } | eof { Tend }The function to produce an abstract syntax tree from a string then goes like this.

let (regexp_of_string : string -> Syntax.regular_expression) = fun s -> let parse_buf lexbuf = try Parser.main Lexer.main lexbuf with | Parsing.Parse_error -> begin let curr = lexbuf.Lexing.lex_curr_p in let line = curr.Lexing.pos_lnum in let cnum = curr.Lexing.pos_cnum - curr.Lexing.pos_bol in let tok = Lexing.lexeme lexbuf in raise (Failure (Printf.sprintf "file \"\", line %d, character %d\n\ Error : Syntax error \"%s\"" line cnum tok )) end in parse_buf (Lexing.from_string s)

Now, the obvious definition for the symmetric `string_of_regexp`

function.

let rec (string_of_regexp : Syntax.regular_expression -> string) = fun re -> match re with | Syntax.Epsilon -> "Epsilon" | Syntax.Character c -> Printf.sprintf "Character '%c'" (Char.chr c) | Syntax.Sequence (p, q) -> Printf.sprintf "Sequence (%s, %s)" (string_of_regexp p) (string_of_regexp q) | Syntax.Alternative (p, q) -> Printf.sprintf "Alternative (%s, %s)" (string_of_regexp p) (string_of_regexp q) | Syntax.Repetition r -> Printf.sprintf "Repetition (%s)" (string_of_regexp r)With these, a quick little program to glimpse what the AST for an expression like $\left(a|b\right)^*abb$ looks like is easy.

Sequence ( Repetition ( Alternative ( Character 'a', Character 'b' ) ), Sequence ( Character 'a', Sequence ( Character 'b', Character 'b' ) ) )