Skip to content

The Tomita GLR parser, originally in the comp.compilers archive, and updates. The project's evolution from 1993-on is being set up here as a timeline.

Notifications You must be signed in to change notification settings

RockBrentwood/Tomita

Repository files navigation

tom -- A Demo for the Tomita Parsing Algorithm

(0) Introduction
This is a free demo for the Tomita Parsing algorithm.  It is released to the public domain free of any restrictions.  You may use it for any purpose as you see fit.

This directory contains C source for the Tomita Natural Language parsing algorithm.  This project has a long history starting in 10/88 based on a study of the Tomita algorithm done in 7/87 and reviewed in 9/88.  The source code, however, was written from scratch between 5/20/93 and 5/23/93.

Possible extensions to this include higher LR(k) parsing ability, incremental parsing table generation, optimization, table-packing, actions/disambiguation rules, lexical rules, etc.  Another possibility is the incorporation of Quantum Parsing.

This implementation of the Tomita Algorithm has not been formally verified with respect to cyclic grammars, but hours of searching for counter-examples has failed to turn up any negative results.

To install this software, make sure you have an ANSI-C compiler.  Edit the Makefile, if necessary, and compile by typing "make".

The following files are included:
                The source code ......... Makefile tom.c
                Documentation ........... tomita.txt README
                The Algorithm ........... tomita.txt
                Sample grammars ......... tom*.in
                Sample parsing tables ... tom*.ex
                Sample input & output ... tom2x.in tom2x.ex
The reference, tomita.txt, may also be found in the comp.compilers archives from May 1993.

The file tom2.ex was created with the command line:
                   tom -c tom2.in >tom2.ex
and typing an empty line to terminate parsing; similarly for the other files tom*.ex.  The file "tom2x.ex" was created with the command line:
                   tom -s tom2.in <tom2x.in >tom2x.ex

(1) Command line arguments
The command line has the following format:
                      tom -c -s grammar
The listed file, "grammar", should contain a grammar listed in a form suitable for processing by the parser generator.  The options, -s and -c, if you use them, will have the following effects:
         -c Writes the LR(0) parsing tables for the grammar file to standard output.
         -s Displays the parsing stack of each parse to standard output.
Parses are accepted from standard input and the results are displayed to standard output.

Example:
                      tom -s tom2.in <tom2x.in >tom2x.ex
will process the grammar listed in "tom2.in", read input from "tom2x.in" and produce a listing of the parse results and parse stacks to the file "tom2x.ex".

(2) Grammars
The file tom2.in contains a small demo grammar.  In it, you will see rules of the following types:
             (1) Productions, e.g., NP = d n | NP PP.
             (2) Lexical Classificatiom Rules, e.g., p: in with.
             (3) Start Symbol Declaration, * S.
Also, as a special case of (2), one can list a rule such as:
                       d.
which is equated to the lexical rule:
                     d: "d".
Items may be enclosed in double quotes, and MUST be if they consist of any symbols other than the alphanumeric symbols, or if they start in a digit.

A Start Symbol can only be declared once, and if it is not declared then the symbol on the left-hand side of the first rule in the grammar file will be treated as the start symbol.

The entries on the right hand side of lexical classification rules are treated as literals.  These are the actual items that the parser will expect to see when parsing an input.  They are distinct from the other parsing symbols so that for instance if one has a rule such as:
                 with = with1 | with2.
this "with" is treated as a distinct item.

To avoid confusion, avoid using the same names for lexical entries and parsing symbols.

(3) Input Format
Input is assumed to consist of a list of items separated by spaces, all in the same line.  Each line is treated as a separate entry for the parser.  If you are using the parser in the interactive mode, then typing an empty line will end the program.

Sample files are included with this program.

(4) Output Format
(a) Parsing Tables
With the -c option, the LR(0) parsing table will be listed in standard output before parsing the input.  The reason for this option is that the parsing stack makes reference to parsing states, so this listing is provided as a cross-reference.  The parsing table for the file tom2.in is listed in tom2.ex.  The listing consists of the following:

State:
	list of items
        ...
For example, the start state (state 0) in tom2.ex is listed as:
0:
	S => 1
	NP => 2
	d => 3
Items can be of any of the following forms examplified by excerpts from tom2.ex:
      (i) ACCEPT
      1:
	      accept
This indicates that state 1 is the accepting state of the LR(0) parser.  The accepting state will always be state 1.
      (ii) SHIFTS and GOTOS
      3:
	      n => 8
This example indicates a shift (or goto) on item "n" from state 3 to state 8.  The difference between shifts and gotos in LR(k) parsing has no bearing on the software here, so they are listed in a common format.
      (iii) REDUCTIONS
      4:
	      [NP -> NP PP]
This indicates that in state 4, a reduction action may be applied whereby the last two items on the parsing stack (which will always be nodes of types PP and NP) are to be combined into a new node of type NP.  A goto on the item NP is then executed from the state located two levels back in the stack to a new state.

(b) Parsing Stack
The output corresponding to the file "tom2x.in" (processed using grammar tom2.in), is listed in "tom2x.ex".  Included in this listing is the following listing of the parsing stack:

Parse Stack:
 v_0_0
 v_1_3 <=	 [ d_0_1 ] <= v_0_0
 v_2_8 <=	 [ n_1_2 ] <= v_1_3
 v_2_2 <=	 [ NP_0_2 ] <= v_0_0
 v_3_7 <=	 [ v_2_3 ] <= v_2_2
 v_4_3 <=	 [ d_3_4 ] <= v_3_7
 v_5_8 <=	 [ n_4_5 ] <= v_4_3
 v_5_11 <=	 [ NP_3_5 ] <= v_3_7
 v_5_5 <=	 [ VP_2_5 ] <= v_2_2
 v_5_1 <=	 [ S_0_5 ] <= v_0_0

The stack is actually a graph consisting of nodes labeled v_N_S, and with edges labeled by nodes of the form [ X_M_N ].  All connections are of the form:
                 v_N_T  <=	 [ X_M_N ] <= v_M_S1  v_M_S2 ...
and the following properties will always hold:
              (a) SHIFT(S1, X) = SHIFT(S2, X) = ... = T
              (b) M <= N.
The meaning of the labels are as follows:
                  v_N_S: N = the parsing position in the input.
                         S = the parsing state.
                  X_M_N: X = the parsing item.
                         M, N = the start and end positions of the item in the input.
If the input is labeled as follows:
0   1   2   3   4    5
 the boy saw the girl
then the edge:
                     v_5_11 <=	 [ NP_3_5 ] <= v_3_7
indicates that the parser went from state 7 at position 3 to state 11 at position 5 by shifting an item of type NP parsed from the items: "the girl".  More than one edge may be labeled with the same node.

(b) Parse Forest
Also in the file "tom2x.ex" is the parse forest corresponding to the input:
                          the boy saw the girl
processed with grammar tom2.in.  This forest is listed in equational form as a grammar which can even be input to the parser as a grammar, itself.  This grammar is characterized as the largest sub-grammar of the input grammar that generates the one element set { the input }.  In fact, if this grammar is used to parse the same input, the result will be the same parse forest and parse stack, except for relabeling the nodes and vertices.

What this illustrates, among other things, is that this algorithm is actually a special case of a more general algorithm that computes the intersection of a context free grammar with a regular grammar.

About

The Tomita GLR parser, originally in the comp.compilers archive, and updates. The project's evolution from 1993-on is being set up here as a timeline.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published