doc/sbp.html

   1 <html>
   2 <head><title>The Scannerless Boolean Parser (SBP)</title>
   3 <style>
   4
   5     H1 {
   6         margin-left: -20px;
   7         margin-top: 30px;
   8         font-family: helvetica, verdana, arial, sans-serif;
   9         font-size: 14pt;
  10         font-weight: bold;
  11         text-align: left;
  12         width : 100%;
  13         border-top-width: 2pt;
  14         border-top-style: solid;
  15     }
  16
  17     H2 {
  18         font-family: helvetica, verdana, arial, sans-serif;
  19         font-size: 12pt;
  20         font-weight: bold;
  21     }
  22
  23     H3 {
  24         margin-left: -10px;
  25         font-family: helvetica, verdana, arial, sans-serif;
  26         font-size: 12pt;
  27         font-weight: bold;
  28     }
  29
  30     TH, TD, P, LI {
  31         font-family: helvetica, verdana, arial, sans-serif;
  32         font-size: 13px;
  33         text-decoration:none;
  34     }
  35
  36     LI { margin-top: 5px; }
  37
  38 </style>
  39 </head>
  40 <body>
  41 <center><table><tr><td width=600>
  42
  43 <center>
  44 <font style='font-size:24pt; font-family:helvetica, verdana, arial, sans-serif'>
  45 <b>SBP: the Scannerless Boolean Parser</b></font>
  46 </center>
  47
  48 <h1>What is it?</h1>
  49
  50 The Scannerless Boolean Parser (SBP) is a scannerless parser for <a
  51 href=http://www.cs.queensu.ca/home/okhotin/boolean/>boolean
  52 grammars</a> (a superset of context-free grammars).  It is written in
  53 Java and emits Java source code.
  54
  55 <h1>What is interesting about it?</h1>
  56
  57 SBP deliberately sacrifices performance in favor of ease of extensibility.
  58 <p>
  59
  60 Since it is an implementation of the (modified) <a
  61 href=http://www.program-transformation.org/Sdf/GeneralizedLR>Lang-Tomita
  62 GLR algorithm</a>, SBP supports all context-free languages.
  63 <p>
  64
  65 It is <a
  66 href=http://en.wikipedia.org/wiki/Lexerless_parsing>scannerless</a>
  67 (does not require a lexer).  This allows it to easily handle languages
  68 which have non-regular lexical structure or lack a clear lexer-parser
  69 distinction, such as TeX, XML, RFC1738 (URLs), ASN.1, SMTP headers,
  70 and Wiki markup.
  71 <p>
  72
  73 In addition to the juxtaposition and union operators provided in
  74 context-free languages, SBP supports grammars which use the
  75 intersection operator (<a
  76 href=http://www.cs.queensu.ca/home/okhotin/conjunctive/>conjunctive
  77 grammars</a>) and the complement operator (<a
  78 href=http://www.cs.queensu.ca/home/okhotin/boolean/>boolean
  79 grammars</a>).
  80
  81 <h1>What features does it have?</h1>
  82
  83 Features fully implemented are in <font color=green>green</font>;
  84 those partially implemented are in <font color=orange>orange</font>;
  85 those unimplemented (but planned) are in <font color=red>red</font>.
  86
  87 <ul> <li> <b>An implementation of the Lang-Tomita GLR parsing algorithm</b>
  88      <ul>
  89           <li> Including <font color=green>Johnstone &amp; Scott's RNGLR algorithm</font> for epsilon-productions</a>
  90
  91           <li> <a href=http://citeseer.ist.psu.edu/vandenbrand02disambiguation.html><font color=green>Visser's</font> extensions</a>
  92                for <font color=green>scannerless parsing</font>
  93                <ul> <li> <font color=green>Follow</font>, <font color=green>Avoid, Prefer</font>, <font color=green>Reject</font> constraints
  94                     <li> <font color=green>Character ranges</font>
  95                     <li> Automatic insertion of <font color=green>whitespace/comments</font>
  96                </ul>
  97
  98           <li> <font color=green>Any topological space</font> can be
  99                used as an alphabet (need not be discrete)
 100           <ul> <li> <font color=green>Unicode</font>
 101                <li> <font color=orange>Trees</font>
 102           </ul>
 103
 104           <li> <font color=green>Associativity constraints</font> on <font color=green><i>n</i>-ary operators</font>
 105
 106      </ul>
 107
 108      <li> <b>Ability to parse a wide variety of grammars in
 109           </b> O(n<sup>3</sup>) time:
 110
 111      <ul>
 112           <li> <font color=green>all context-free grammars</font>
 113
 114           <li> <font color=green>epsilon productions</font>, <font
 115                color=green>included in the parse forest</font>
 116
 117           <li> <font color=green>circularities</font>, <font
 118          color=red>included in the parse forest</font>.
 119
 120           <li> Regular expression operators (
 121                <tt><font color=green>*</font></tt>,
 122                <tt><font color=green>?</font></tt>,
 123                <tt><font color=green>+</font></tt>
 124                )
 125
 126           <li> <font color=green>conjunctive grammars</font>
 127                (<font color=green>intersection</font> operator)
 128
 129           <li> <font color=orange>boolean grammars</font> (<font
 130                color=green>intersection</font>, <font
 131                color=green>intersect-with-complement</font>, and
 132                <font color=orange>generalized-complement</font>)
 133      </ul>
 134
 135
 136      <li> <b>Facilitates experimenting with grammars</b>
 137
 138      <ul>
 139           <li> <font color=green>Interpreted mode</font>, in which the
 140                parse table is interpreted directly, eliminating the
 141                need for a compiler and making it easier for grammars
 142                to operate on grammars.
 143
 144           <li> <font color=green>Simple
 145                <a href=api/edu/berkeley/sbp/package-summary.html>API</a></font>
 146                makes it easy to generate, analyze, and modify grammars
 147                programmatically.
 148
 149           <ul>
 150               <li> Components of a grammar (nonterminals,
 151                    productions, etc) <font
 152                    color=green>represented as objects</font>
 153                <li> composite elements implement <font color=green><tt>Iterable&lt;T&gt;</tt></font>
 154           </ul>
 155
 156           <li> <font color=red>Compiled mode</font>, in which Java
 157                source code is emitted; compiling this code yields a
 158                parser.  The resulting parser is <i>much</i> faster.
 159      </ul>
 160
 161
 162 </ul>
 163
 164 <h1>What is it deliberately missing?</h1>
 165
 166 <ul> <li> Semantic actions; the only option is to return a parse forest.
 167      <ul>
 168            <li> This keeps the grammar specification language-neutral.
 169            <li> A grammar can, however, indicate that certain parts of the parse tree should be dropped.
 170      </ul>
 171 </ul>
 172
 173 <h1>What features would be nice to have?</h1>
 174
 175 <ul>
 176     <li> <strike>Drop Farshi's algorithm and use <a
 177          href=http://doi.ieeecomputersociety.org/10.1109/HICSS.2002.994495>GRMLR</a></strike>.
 178          <font color=green>Done!</font>
 179
 180     <li> An implementation of the <a
 181          href=http://www.cs.berkeley.edu/~smcpeak/elkhound/sources/elkhound/algorithm.html>McPeak-Necula
 182          optimization</a> for bounded-depth determinism.
 183
 184     <li> Lazy parse trees, to decrease the space requirements from
 185          o(n) to o(1) [but still O(n)].
 186
 187     <li> Consider implementing <a
 188          href=http://www.cs.uvic.ca/~nigelh/Publications/cc99-paper.pdf>
 189          Aycock-Horspool</a> unrolling.  Improves performance with
 190          only highly localized increase in algorithmic complexity.
 191          Subsumes many other optimizations.
 192
 193 </ul>
 194
 195 <h1>What are the long term goals?</h1>
 196
 197 As we come to a more mature understanding of the pragmatic aspects of
 198 boolean grammars, a long-term goal is to migrate support for these
 199 features to existing high-performance GLR implementations (<a
 200 href=http://www.cs.berkeley.edu/~smcpeak/elkhound/>Elkhound</a>, <a
 201 href=http://www.delorie.com/gnu/docs/bison/bison_90.html>bison-glr</a>).
 202
 203 <h1>Where can I read more about it?</h1>
 204
 205 <ul> <li> The <a href=../README>README</a> file is the best place to start
 206      <li> After that, be sure to read <a href=jargon.txt>jargon.txt</a>
 207      <li> The <a
 208           href=api/edu/berkeley/sbp/package-summary.html>javadoc</a>
 209           is the best description of the API
 210      <li> There's a <a href=../tests/meta.g>tentative metagrammar</a>,
 211           written in itself.
 212      <li> You can also get <a href=osq.lunch.talk.pdf>slides</a>
 213           from my talk at the OSQ Lunch on 02-Nov-2005, though some of
 214           the stuff (specifically what SBP can and cannot do) is
 215           outdated.
 216      <li> A <a href=preprint.pdf>preprint</a> of one of my conference
 217           submissions.
 218 </ul>
 219
 220 <h1>Where can I get it?</h1>
 221
 222 The color coding above accurately reflects the state of the
 223 implementation (<font color=green>11-Dec-2005</font>).  However, in its current state it is a
 224 bit messy, and may require a bit of fiddling to get it to do what you
 225 want.  This situation should improve in the next few weeks as I am
 226 done adding features (for now) and am currently focusing on
 227 reliability, cleanliness, and performance.
 228 <p>
 229
 230 SBP is available under the BSD license.
 231 <p>
 232
 233 You can download a snapshot (<font color=green>11-Dec-2005</font>) <a
 234 href=../../sbp/edu.berkeley.sbp.tar.gz>here</a>.  The parser-generator
 235 requires Java 1.5 or later; the Java code it emits <font
 236 color=orange>should run on any Java 1.1+ JVM</font>.  After unpacking
 237 the archive, simply type <tt>make</tt> to compile SBP and run the
 238 regression tests.
 239
 240 </td></tr></table></center>
 241 </body>
 242 </html>