## Sunday, December 7, 2014

### Parsing regular expressions

Here's a brief refresher on regular expressions derived from Cosineau and Mauny.

Given an alphabet $\mathcal{A}$, a regular expression $e$ over $\mathcal{A}$ is a concise way to describe a set $\mathcal{L}$ of strings of elements of $\mathcal{A}$. The set $\mathcal{L}_{e}$ denoted by an expression $e$ is termed the regular language denoted by $e$.

In their most basic form, regular expressions can be defined like this:

• $\epsilon$ represents the set $\mathcal{L}_{\epsilon} = \left\{\epsilon\right\}$ containing only the empty string
• A symbol $c$ of $\mathcal{A}$ denotes $\mathcal{L}_{c} = \left\{c\right\}$
• $e_{1}e_{2}$ is the set $\mathcal{L}_{e_{1}}\cdot\mathcal{L}_{e_{2}} = \left\{s_{1}s_{2} | s_{1} \in \mathcal{L}_{e_{1}}, s_{2} \in \mathcal{L}_{e_{2}}\right\}$
• $e_{1}|e_{2}$ is the set $\mathcal{L}_{e_{1}}\bigcup \mathcal{L}_{e_{2}}$
• $e^{*}$ is the union $\bigcup_{i=0}^{\infty}\mathcal{L}_{e}^{i}$ where $\mathcal{L}_{e}^{0} = \mathcal{L}_{\epsilon}$ and $\mathcal{L}_{e}^{i} = \mathcal{L}_{e}^{i-1}\cdot\mathcal{L}_{e}$
This last operator $*$ is called the Kleene closure operator and $e^{*}$ equivalent to the infinite regular expression $\epsilon|e|ee|eee|\dots$

Regular expressions provide a convenient syntax for searching plain text data for lines matching specific patterns as demonstrated by the venerable Unix function 'grep' (I was taught grep is an acronym for generalized regular expression pattern-matching). To produce a function like grep requires a program for analyzing a regular expression $e$ and computing a recognizer for $\mathcal{L}_{e}$, that is a function string -> bool, that can be used to classify a string $s$ (say) in terms of whether or not $s \in \mathcal{L}_{e}$. This procedure of computing a recognizer for $\mathcal{L}_{e}$ from $e$ is called compiling a regular expression.

Writing a regular expression compiler is not for the faint-of-heart! Doing so requires more work than can be reasonably described in one blog-post so here, I'll focus only on the first part - parsing a regular expression. That is, given a string $s$ a program to produce an abstract syntax tree describing the regular expression $e$ represented by $s$.

To begin, we define the type of regular expressions.

type regular_expression =
| Epsilon
| Character of int
| Sequence of regular_expression * regular_expression
| Alternative of regular_expression * regular_expression
| Repetition of regular_expression

We could proceed from here by hand-crafting a recursive descent parser or we might take advantage of a parser generator system like ocamllex/ocamlyacc which is the choice I favored. On reaching that decision it occurred to me that, since ocamllex itself treats regular expressions as inputs, the source code for ocamllex itself might contain a suitable grammar for the expression type above and indeed it does! Things at this point become marvelously recursive - to build a regular expression compiler we can bootstrap (at both the source and binary levels) from an existing regular expression compiler (which itself is a bootstrap by the way)!

The code that follows is an adaptation of the parser and lexer input files for ocamllex (the copyright of Xavier Leroy).

/*parser.mly*/

%{
open Syntax
%}

%token <int> Tchar

%token Tend Tor Tlbracket Trbracket
%token Tstar Tmaybe Tplus Tlparen Trparen

%left Tor
%nonassoc CONCAT
%nonassoc Tmaybe Tstar Tplus
%nonassoc Tchar Tlbracket Tlparen

%start main
%type <Syntax.regular_expression> main
%%

main:
| regexp Tend
{ $1 } ; regexp: | Tchar { Character$1 }
| regexp Tstar
{ Repetition $1 } | regexp Tmaybe { Alternative (Epsilon,$1) }
| regexp Tplus
{ Sequence (Repetition ($1),$1)}
| regexp Tor regexp
{ Alternative ($1,$3) }
| regexp regexp %prec CONCAT
{ Sequence ($1,$2) }
| Tlparen regexp Trparen
{ $2 } ;  Here is the corresponding lexer input file. { open Parser let incr_loc lexbuf delta = let pos = lexbuf.Lexing.lex_curr_p in lexbuf.Lexing.lex_curr_p <- { pos with Lexing.pos_lnum = pos.Lexing.pos_lnum + 1; Lexing.pos_bol = pos.Lexing.pos_cnum - delta; } let char_for_backslash = function | 'n' -> '\010' | 'r' -> '\013' | 'b' -> '\008' | 't' -> '\009' | c -> c } let special_chars = ['|' '*' '?' '+' '(' ')'] let backslash_escapes = ['\\' '\'' '"' 'n' 't' 'b' 'r' ' '] rule main = parse [' ' '\013' '\009' '\012' ] + { main lexbuf } | '\010' { incr_loc lexbuf 0; main lexbuf } | [^ '\\' '|' '*' '?' '+' '(' ')'] { Tchar(Char.code (Lexing.lexeme_char lexbuf 0)) } | '\\' backslash_escapes { Tchar (Char.code (char_for_backslash (Lexing.lexeme_char lexbuf 1))) } | '\\' special_chars { Tchar (Char.code (Lexing.lexeme_char lexbuf 1)) } | '|' { Tor } | '*' { Tstar } | '?' { Tmaybe } | '+' { Tplus } | '(' { Tlparen } | ')' { Trparen } | eof { Tend }  The function to produce an abstract syntax tree from a string then goes like this. let (regexp_of_string : string -> Syntax.regular_expression) = fun s -> let parse_buf lexbuf = try Parser.main Lexer.main lexbuf with | Parsing.Parse_error -> begin let curr = lexbuf.Lexing.lex_curr_p in let line = curr.Lexing.pos_lnum in let cnum = curr.Lexing.pos_cnum - curr.Lexing.pos_bol in let tok = Lexing.lexeme lexbuf in raise (Failure (Printf.sprintf "file \"\", line %d, character %d\n\ Error : Syntax error \"%s\"" line cnum tok )) end in parse_buf (Lexing.from_string s)  Now, the obvious definition for the symmetric string_of_regexp function. let rec (string_of_regexp : Syntax.regular_expression -> string) = fun re -> match re with | Syntax.Epsilon -> "Epsilon" | Syntax.Character c -> Printf.sprintf "Character '%c'" (Char.chr c) | Syntax.Sequence (p, q) -> Printf.sprintf "Sequence (%s, %s)" (string_of_regexp p) (string_of_regexp q) | Syntax.Alternative (p, q) -> Printf.sprintf "Alternative (%s, %s)" (string_of_regexp p) (string_of_regexp q) | Syntax.Repetition r -> Printf.sprintf "Repetition (%s)" (string_of_regexp r)  With these, a quick little program to glimpse what the AST for an expression like$\left(a|b\right)^*abb\$ looks like is easy.
Sequence (
Repetition (
Alternative (
Character 'a',
Character 'b'
)
),
Sequence (
Character 'a',
Sequence (
Character 'b',
Character 'b'
)
)
)