SBP: the Scannerless Boolean Parser
16-Aug: A new snapshot is here.
Update: [29-July-2006]
Error handling has been massively improved. Here's an example parsing
from a substantial portion of the Java 1.5
grammar. The input is missing a
closing angle-bracket on a generic type definition. Click on the
image to view full size. Type
make java15 after a checkout to try it yourself.
|
Update: [05-July-2006]
The reflective grammar-to-java bindings are complete, so SBP is now
vastly easier to use. You can find example code here
and the companion grammar here.
|
What is it?
The Scannerless Boolean Parser (SBP) is a scannerless parser for boolean
grammars (a superset of context-free grammars). It is written in
Java and emits Java source code.
What is interesting about it?
SBP deliberately sacrifices performance in favor of ease of extensibility.
Since it is an implementation of the (modified) Lang-Tomita
GLR algorithm, SBP supports all context-free languages.
It is scannerless
(does not require a lexer). This allows it to easily handle languages
which have non-regular lexical structure or lack a clear lexer-parser
distinction, such as TeX, XML, RFC1738 (URLs), ASN.1, SMTP headers,
and Wiki markup.
In addition to the juxtaposition and union operators provided in
context-free languages, SBP supports grammars which use the
intersection operator (conjunctive
grammars) and the complement operator (boolean
grammars).
What features does it have?
Features fully implemented are in green;
those partially implemented are in orange;
those unimplemented (but planned) are in red.
- An implementation of the Lang-Tomita GLR parsing algorithm
- Including Johnstone & Scott's RNGLR algorithm for epsilon-productions
- Visser's extensions
for scannerless parsing
- Follow, Avoid, Prefer, Reject constraints
- Character ranges
- Automatic insertion of whitespace/comments
- Any topological space can be
used as an alphabet (need not be discrete)
- Associativity constraints on n-ary operators
- Ability to parse a wide variety of grammars in
O(n3) time:
- all context-free grammars
- epsilon productions, included in the parse forest
- circularities, included in the parse forest.
- Regular expression operators (
*,
?,
+
)
- conjunctive grammars
(intersection operator)
- boolean grammars (intersection, intersect-with-complement, and
generalized-complement)
- Facilitates experimenting with grammars
- Interpreted mode, in which the
parse table is interpreted directly, eliminating the
need for a compiler and making it easier for grammars
to operate on grammars.
- Simple
API
makes it easy to generate, analyze, and modify grammars
programmatically.
- Components of a grammar (nonterminals,
productions, etc) represented as objects
- composite elements implement Iterable<T>
- Compiled mode, in which Java
source code is emitted; compiling this code yields a
parser. The resulting parser is much faster.
What is it deliberately missing?
- Semantic actions; the only option is to return a parse forest.
- This keeps the grammar specification language-neutral.
- A grammar can, however, indicate that certain parts of the parse tree should be dropped.
What features would be nice to have?
-
Drop Farshi's algorithm and use GRMLR.
Done!
- An implementation of the McPeak-Necula
optimization for bounded-depth determinism.
- Lazy parse trees, to decrease the space requirements from
o(n) to o(1) [but still O(n)].
- Consider implementing
Aycock-Horspool unrolling. Improves performance with
only highly localized increase in algorithmic complexity.
Subsumes many other optimizations.
What are the long term goals?
As we come to a more mature understanding of the pragmatic aspects of
boolean grammars, a long-term goal is to migrate support for these
features to existing high-performance GLR implementations (Elkhound, bison-glr).
Where can I read more about it?
- The README file is the best place to start
- After that, be sure to read jargon.txt
- The javadoc
is the best description of the API
- There's a tentative metagrammar,
written in itself.
- You can also get slides
from my talk at the OSQ Lunch on 02-Nov-2005, though some of
the stuff (specifically what SBP can and cannot do) is
outdated.
- A preprint of one of my conference
submissions.
Where can I get it?
The color coding above accurately reflects the state of the
implementation (11-Dec-2005). However, in its current state it is a
bit messy, and may require a bit of fiddling to get it to do what you
want. This situation should improve in the next few weeks as I am
done adding features (for now) and am currently focusing on
reliability, cleanliness, and performance.
SBP is available under the BSD license.
You can download a snapshot (11-Dec-2005) here. The parser-generator
requires Java 1.5 or later; the Java code it emits should run on any Java 1.1+ JVM. After unpacking
the archive, simply type make to compile SBP and run the
regression tests.
|