Skip to content
This repository has been archived by the owner on May 28, 2024. It is now read-only.


Folders and files

Last commit message
Last commit date

Latest commit



4 Commits

Repository files navigation

RobertNLP Enhanced UD Parser


This repository contains the companion code for RobertNLP, the Enhanced Universal Dependencies parser described in the paper

Stefan Grünewald and Annemarie Friedrich (2020): RobertNLP at the IWPT 2020 Shared Task: Surprisingly Simple Enhanced UD Parsing for English. Proceedings of IWPT 2020.

The code allows users to reproduce and extend the results reported in the study. Please cite the above paper when using our code, and direct any questions or feedback regarding our parser at Stefan Grünewald.

Disclaimer: Purpose of the Project

This software is a research prototype, solely developed for and published as part of the publication cited above. It will neither be maintained nor monitored in any way.

Parser architecture

Project Structure

The repository has the following directory structure:

    configs/                            Example configuration files for the parser
        corpora/                        Folder to put training/validation UD corpora
            ewt/                        Folder for EWT corpus files
                vocab/                  Vocabulary files for EWT corpus
               Script for downloading EWT corpus files
        Script for replacing lexical material in dependency labels with placeholders in an English UD corpus
        pretrained-embeddings/          Folder for pre-trained word embeddings
        Script for downloading pre-trained word embeddings
        saved_models/                   Folder for saved models
            Script for downloading trained parser models (forthcoming)
        data_handling/                  Code for processing dependency-annotated sentence data
        logger/                         Code for logging (--> boring)
        models/                         Actual model code (parser, classifier)
            embeddings/                 Code for handling contextualized word embeddings
        pre_post_processing/            Dependency graph pre-/post-processing (as used in IWPT 2020 Shared Task)
        trainer/              Training logic                  Initialization of model, trainer, and data loaders                 Main script for parsing corpora using a trained model                        Main script for training models
    environment.yml                     Conda environment file for RobertNLP                   Script for parsing from raw text with heuristic post-processing (as used in Shared Task)

Using the Parser


RobertNLP itself requires the following dependencies:

If you also want to perform the pre- and post-processing steps outlined in our paper, you will additionally need the following dependencies:

With the exception of UDify, you can install all of the above dependencies easily using Conda and the environment.yml file provided by us:

conda env create -f environment.yml
conda activate robertnlp

For UDify, clone their repository into a location of your choice and download their trained model (udify-model.tar.gz) into a location of your choice. Then, see the section below for how to integrate the system into the parsing pipeline.

Downloading Training Data and Pre-trained Embeddings

For simplicity, we have added scripts for downloading training/validation corpora as well as the pre-trained language models used in our system.

For the training/validation data (EWT corpus), use:

cd data/corpora/ewt/

For the pre-trained language models, use:

cd data/pretrained-embeddings/

Note: Since the model files are quite large, the downloads might take a long time depending on your internet connection. You may want to edit the download scripts in order to download only the particular models you are actually interested in (see comments in the respective scripts).

Training Your Own Model

To train your own parser model, run python src/ [CONFIG_PATH].
Example: python src/ configs/roberta-base-768d.json.

If you are using one of the configuration files provided in this repository, the model checkpoints will be written to saved_models/models/[MODEL_NAME]/[TIMESTAMP]/checkpoint_epochX.pth (where X is the epoch number). The same folder will also contain the config.json that can then be used to load the model.

Parsing a Pre-processed Corpus

To parse a corpus of (segmented, pre-tokenized) sentences, run one of the following commands from the root directory of this repository:

  1. If your corpus is in text format (whitespace-tokenized, one sentence per line): python src/ [CONFIG_PATH] [SAVED_MODEL_PATH] [CORPUS_PATH]
  2. If your corpus is in CoNLL-U format: python src/ --conllu [CONFIG_PATH] [SAVED_MODEL_CHECKPOINT_PATH] [CORPUS_PATH]

Parsed sentences will be written (in CoNLL-U format) to stdout.

Example: If you've trained a model based on the roberta-large-1024d.json config and want to parse the development section of the EWT corpus using it, run:

python src/ --conllu data/saved_models/models/RoBERTa_large_1024d/[TIMESTAMP]/config.json

Note 1: Only the enhanced dependencies column will contain actual parser output. The basic dependency columns will be filled with placeholders, and the other columns (e.g. POS, lemma) will be left blank.

Note 2: Parsing corpora with this script will not perform any pre-processing (tokenization and sentence segmentation); the script expects tokenized and segmented text as input. Neither will it perform the heuristic post-processing "graph repair" steps described in our paper, which rely on the output of an external parser. A small number of the resulting dependency graphs may therefore be structurally invalid (i.e., contain nodes which cannot be reached from the root of the sentence). To reproduce our post-processing steps for the IWPT test data, please refer to the corresponding section below.

Reproducing Full IWPT Approach

To exactly reproduce the approach used for our official submission to the IWPT 2020 Shared Task (including pre-processing and post-processing), follow these steps:

  • Train a model using the roberta-large-1024d.json config.
  • Open the script in the main directory of this repository and set the values of UDIFY_PATH and UDIFY_ARCHIVE_PATH to the respective locations on your hard drive.
  • Run, providing a config file, a model checkpoint, and the raw text file you wish to parse, e.g.:
    ./ data/saved_models/models/RoBERTa_large_1024d/[TIMESTAMP]/config.json data/saved_models/models/RoBERTa_large_1024d/[TIMESTAMP]/model_best.pth [RAW_TEXT_FILE]

The script will first tokenize and sentence-segment the input text file using StanfordNLP, then parse the tokenized text using your trained model, and finally post-process the resulting dependency graphs using UDify. Output of the parsing process will be written (in CoNLL-U format) to a new file parsed.conllu.


The RobertNLP parser is open-sourced under the BSD-3-Clause license. See the LICENSE file for details.

For a list of other open source components included in RobertNLP, see the file 3rd-party-licenses.txt.

The software, including its dependencies, may be covered by third party rights, including patents. You should not execute this code unless you have obtained the appropriate rights, which the authors are not purporting to give.


RobertNLP Enhanced Universal Dependencies Parser








No releases published


No packages published