Skip to content

mcoavoux/multilingual_disco_data

Repository files navigation

Multilingual Discontinuous Data

This repository contains scripts to generate data in the input format of the mtg parser. Process three corpora:

bash generate_english_data.sh
bash generate_tiger_data.sh
bash generate_negra_data.sh

Dependencies:

  • python3
  • java (>= 1.8)
  • discodop
  • treetools (install the version of treetools for python2, since the version for python 3 seems to have a bug for the transform option)

Data required (and not included):

  • English:
  • German (Tiger): corpus_data/GERMAN_SPMRL.tar.gz (SPMRL version of TiGer corpus)
  • German (Negra): corpus_data/negra-corpus.tar.gz

For English, the script uses the Stanford parser to convert the ptb to conll dependency trees.

For the Negra corpus, the script uses a modified version of depsy to convert it to dependency trees (the modification just makes sure that the tokenization is not changed by Depsy).

About

Preprocessing scripts for the mtg parser.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published