Folks--
This is a cool note from comp.lang.ml; it came out of a question of
"what are higher-order functions good for" and presents a lexer
generator in under a page of code. I thought it was neat, anyway.
best,
wb
--
Will Benton | "Die richtige Methode der Philosophie wäre eigentlich
willb@xxxxxxx | die: Nichts zu sagen, als was sich sagen läßt...."
To: comp-lang-ml@xxxxxxxxxxxxxxxxxx
Subject: lexer generator in four functions
From: Thant Tessman <thant@xxxxxxx>
Date: Mon, 07 Oct 2002 08:08:53 -0600
Delivered-To: leaf+ml@xxxxxxxxxxxxxxxxx
Newsgroups: comp.lang.ml
Organization: XMission http://www.xmission.com/
Resent-Date: Mon, 07 Oct 2002 12:29:48 -0400
Resent-From: Leaf Eames Petersen <Leaf_Eames_Petersen@xxxxxxxxxxxxxxxxxxxx>
Resent-Message-Id: <200210071638.g97Gckt27213@xxxxxxxxxxxxxxxxxxxxx>
Resent-To: sml-redistribution@xxxxxxxxxx
Sender: news <news@xxxxxxxxxxxx>
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.1b) Gecko/20020721
Okay, I finally spent a couple hours this weekend reassembling my
favorite concise example of why higher-order functions are so cool. It
is a lexer generater comprised at its core of only four small functions.
The programmer assembles these functions into regular expression pattern
matching functions.
The idea is that a pattern matcher function takes a list of streams, and
returns a new list of streams advanced by every combination allowed by
the pattern matcher function. Note that the number of streams returned
by the function typically won't match the number of streams passed in.
If the pattern doesn't match at all, the empty list is returned. In this
implementation, a stream is simply a tuple containing a list of
characters consumed by the pattern matcher, and a list of characters not
yet consumed.
I've added a lot of explanatory comments, but it's worth noting that
without them this thing fits on a single page.
(* The first function 'tok' builds a pattern matcher function
that matches a single specified token (character). *)
fun tok t =
let fun f (s,h::r) = if h=t then SOME (h::s,r) else NONE
| f _ = NONE
in List.mapPartial f
end
(* This matches a sequence of patterns. *)
fun seq rules streams =
foldl (fn (f, s) => f s) streams rules
(* This matches any of a list of patterns. It's analogous to
a series of patterns separated by the '|' in traditional
regular expressions. *)
fun bar rules streams =
List.concat (map (fn f => f streams) rules)
(* Kleene closure. Analogous to '*' *)
fun star rule streams =
let fun loop streams =
case (rule streams) of
[] => []
| results => results::(loop results)
in List.concat (streams::(loop streams))
end
(* The rest of these are built from the previous four and
are provided for convenience. *)
(* Positive closure. Analogous to '+' *)
fun pos rule = seq [rule, star rule]
(* Optional pattern. Analogous to '?' *)
fun opt rule = bar [rule, fn x => x]
(* Range of characters. Analogous to the dash within
a character class '[]' *)
fun range (a,b) =
if b<a then range (b,a) else
let fun loop i = if (i<=Char.ord b)
then (tok (Char.chr i)) :: loop (i+1)
else []
in bar (loop (Char.ord a))
end
(* Matches a literal string specified by 's' *)
fun lit s = seq (map tok (String.explode s))
(* Matches any of a set of characters. *)
fun set s = bar (map tok (String.explode s))
(* The next two functions are for demonstrating the use of
the above functions. This first function takes the resulting
streams produced by the application of a pattern on a stream
(or streams) and selects the longest match. *)
fun longest streams =
let
fun loop [] = (0,([],[]))
| loop ((eaten, food)::rest) =
let
val (max,stream) = loop rest
val l = List.length eaten
in
if l<max then (max,stream)
else (l, ((List.rev eaten), food))
end
val (_,stream) = loop streams
in
stream
end
(* This takes a rule and a string, turns the string into
a list of streams (containing one stream), applies the
rule, and returns the longest match. *)
fun lex rule s = longest (rule [([],String.explode s)])
(* A demonstration of the above. Here's a pattern to match
floating point numbers. *)
(* "-"?(([0-9]+(\.[0-9]+)?)|(\.[0-9]+))([eE][+-]?[0-9]+)? *)
val digit = range (#"0",#"9")
val digits = (pos digit)
val mantissa = seq [tok #".", digits]
val exp = seq [set "eE", opt (set "+-"), digits]
val real = (seq [opt (tok #"-"),
(seq [bar [seq [digits, opt mantissa], mantissa],
opt exp])])
With the above defined, you can do this:
- lex real ".45e-5";
val it = ([#".",#"4",#"5",#"e",#"-",#"5"],[]) : char list * char list
- lex real "hi there";
val it = ([],[]) : char list * char list
And so on...
I think a full-fledged lexer generator in a single page of code is a
good example of the power of functional programming over imperative/OO.
-thant
|