Skip to content

qiuwei/ltagextract

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ltagextract

extract lexicalized tree adjoin grammar from treebank

Introduction

This project intends to extract Tree Adjoining Grammars with semantics aligned from KBGen corpus.

Software depends

Howto

To reproduce our current result, you can either simply run bin/run.sh or follow the pipeline described below:

  1. Deal with the conjunction occurred in the syntactic tree.
  2. Parse sentences using Stanford parser. We use the unlexicalized parser with head information output.
  3. Normalize the syntactic tree gotten from step 2.
  4. Extract TAG from the output of step 3
  5. Assign semantics to the output of step 4

Step 1

To do the coordination aggregation, run

java -jar bin/aggregation-0.1.1-SNAPSHOT-standalone.jar \
  input/triples/ output/aggregated/
`

Step 2

To parse the corpus using the Stanford parser, run

bin/parse.sh input/sentences/ output/parsed/

Step 3

To normalize the syntactic tree, run

java -jar bin/grook-0.1.0-SNAPSHOT-standalone.jar \
  output/parsed/ output/fixed/

Steps 4&5:

To extract the TAG with semantics aligned, run

PYTHONPATH="utilities/nltk-2.0.4/:$PYTHONPATH" python2 bin/extract/extractor.py \
  output/fixed/ input/alignments/ output/final.gram \
  --verbose output/grammar-verbose/

For more details, try running

python2 extractor.py -h
usage: extractor.py [-h] [--verbose VERBOSE] corpus alignment [outfile]

positional arguments:
  corpus             corpus path which should be a directroy
  alignment          alignment path which should be a directory
  outfile            outputfile for extracted grammar

optional arguments:
  -h, --help         show this help message and exit
  --verbose VERBOSE  output raw gammar extracted for each sentence. This
                     parameter should be a directory

to check the help.

Other

We also provide a small tool to help you visualize TAG extracted from step 4 or step 5, run

python2 grammarviewer.py -h
usage: grammarviewer.py [-h] [filename]

Draw the tree according to grammar file

positional arguments:
  filename    The name of grammar file, stdin will be used if left open

optional arguments:
  -h, --help  show this help message and exit

As a side product, our package provides a s-expression parser for python. You may want to use it to reconstruct ParentedTree(NLTK) from the plain text representation of TAG.

Description about the files

  • ./bin contains all runnable programs and scripts
  • ./src contains all the src code
  • ./output contains the intermediate results generated by the programs.
  • ./input contains the original corpus, annotated data
    • ./input/alignment contains our annotation result
    • ./input/heads-fixed
    • ./input/aggregation
  • ./report contains our report

About

extract lexicalized tree adjoin grammar from treebank

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published