SBP: the Scannerless Boolean Parser

16-Aug: A new snapshot is here.

Update: [29-July-2006]

Error handling has been massively improved. Here's an example parsing from a substantial portion of the Java 1.5 grammar. The input is missing a closing angle-bracket on a generic type definition. Click on the image to view full size. Type make java15 after a checkout to try it yourself.

Update: [22-July-2006]

The API has been finalized and includes a decent example/mini-tutorial.

Update: [17-July-2006]

There is now a mailing list.

Update: [05-July-2006]

The reflective grammar-to-java bindings are complete, so SBP is now vastly easier to use. You can find example code here and the companion grammar here.

What is it?

The Scannerless Boolean Parser (SBP) is a scannerless parser for boolean grammars (a superset of context-free grammars). It is written in Java and emits Java source code.

What is interesting about it?

SBP deliberately sacrifices performance in favor of ease of extensibility.

Since it is an implementation of the (modified) Lang-Tomita GLR algorithm, SBP supports all context-free languages.

It is scannerless (does not require a lexer). This allows it to easily handle languages which have non-regular lexical structure or lack a clear lexer-parser distinction, such as TeX, XML, RFC1738 (URLs), ASN.1, SMTP headers, and Wiki markup.

In addition to the juxtaposition and union operators provided in context-free languages, SBP supports grammars which use the intersection operator (conjunctive grammars) and the complement operator (boolean grammars).

What features does it have?

Features fully implemented are in green; those partially implemented are in orange; those unimplemented (but planned) are in red.

An implementation of the Lang-Tomita GLR parsing algorithm
- Including Johnstone & Scott's RNGLR algorithm for epsilon-productions
- Visser's extensions for scannerless parsing
  - Follow, Avoid, Prefer, Reject constraints
  - Character ranges
  - Automatic insertion of whitespace/comments
- Any topological space can be used as an alphabet (need not be discrete)
  - Unicode
  - Trees
- Associativity constraints on n-ary operators
Ability to parse a wide variety of grammars in O(n³) time:
- all context-free grammars
- epsilon productions, included in the parse forest
- circularities, included in the parse forest.
- Regular expression operators ( *, ?, + )
- conjunctive grammars (intersection operator)
- boolean grammars (intersection, intersect-with-complement, and generalized-complement)
Facilitates experimenting with grammars
- Interpreted mode, in which the parse table is interpreted directly, eliminating the need for a compiler and making it easier for grammars to operate on grammars.
- Simple API makes it easy to generate, analyze, and modify grammars programmatically.
  - Components of a grammar (nonterminals, productions, etc) represented as objects
  - composite elements implement Iterable<T>
- Compiled mode, in which Java source code is emitted; compiling this code yields a parser. The resulting parser is much faster.

What is it deliberately missing?

Semantic actions; the only option is to return a parse forest.
- This keeps the grammar specification language-neutral.
- A grammar can, however, indicate that certain parts of the parse tree should be dropped.

What features would be nice to have?

~~Drop Farshi's algorithm and use GRMLR~~. Done!
An implementation of the McPeak-Necula optimization for bounded-depth determinism.
Lazy parse trees, to decrease the space requirements from o(n) to o(1) [but still O(n)].
Consider implementing Aycock-Horspool unrolling. Improves performance with only highly localized increase in algorithmic complexity. Subsumes many other optimizations.

What are the long term goals?

As we come to a more mature understanding of the pragmatic aspects of boolean grammars, a long-term goal is to migrate support for these features to existing high-performance GLR implementations (Elkhound, bison-glr).

Where can I read more about it?

The README file is the best place to start
After that, be sure to read jargon.txt
The javadoc is the best description of the API
There's a tentative metagrammar, written in itself.
You can also get slides from my talk at the OSQ Lunch on 02-Nov-2005, though some of the stuff (specifically what SBP can and cannot do) is outdated.
A preprint of one of my conference submissions.

Where can I get it?

The color coding above accurately reflects the state of the implementation (11-Dec-2005). However, in its current state it is a bit messy, and may require a bit of fiddling to get it to do what you want. This situation should improve in the next few weeks as I am done adding features (for now) and am currently focusing on reliability, cleanliness, and performance.

SBP is available under the BSD license.

You can download a snapshot (11-Dec-2005) here. The parser-generator requires Java 1.5 or later; the Java code it emits should run on any Java 1.1+ JVM. After unpacking the archive, simply type make to compile SBP and run the regression tests.