This repository contains the code for our 2021 publication, adapted from an existing implementation.
You can use our pre-trained models which are available the github releases. The simplest way to use the models for inference is via torchserve, download the models and refer to the torchserve section of this README.
If you base your work on our models, incremental code or the German dataset processing, please cite our paper:
@inproceedings{schroeder-etal-2021-coref,
title = {Neural End-to-end Coreference Resolution for {G}erman in Different Domains},
author = {Schröder, Fynn and Hatzel, Hans Ole and Biemann, Chris},
year = {2021},
booktitle = {Proceedings of the 17th Conference on Natural Language Processing},
address = {Düsseldorf, Germany}
}
Most of the code in this repository is not language specific and can easily be adapted to many other languages. mBERT models performed on par with some older German-specific models in our experiments, so even if you are dealing with lower resource language it may be possible to train a decent model.
The basic end-to-end coreference model (as found in the original implementation) is a PyTorch re-implementation based on the TensorFlow model following similar preprocessing (see this repository).
The code was extended to handle incremental coreference resolution and separate mention pre-training.
Files:
- run.py: training and evaluation
- run_mentions.py: training and evaluation of mentions
- model.py: the coreference model
- higher_order.py: higher-order inference modules
- analyze.py: result analysis
- preprocess.py: converting CoNLL files to examples
- tensorize.py: tensorizing example
- conll.py, metrics.py: same CoNLL-related files from the repository
- experiments.conf: different model configurations
- split_droc.py: create train/dev/test splits for the German literature "DROC" dataset
- split_gerdracor.py: create train/dev/test splits for the German Drama dataset GerDraCor
- split_tuebadz_10.py: create train/dev/test splits for the TüBa-D/Z dataset version 10
- split_tuebadz_11.py: create train/dev/test splits for the TüBa-D/Z dataset version 11
Set up environment and data for training and evaluation:
- Install PyTorch for your platform
- Install Python3 dependencies:
pip install -r requirements.txt
- All data and config files are placed relative to the:
base_dir = /path/to/project
in local.conf so change it to point to the root of this repo - All splits created using the
split_*
Python scripts will need to be processed usingpreprocess.py
to be used as training input for the model, for example, to split the DROC dataset run:python split_droc.py --type-system-xml /path/to/DROC-Release/droc/src/main/resources/CorefTypeSystem.xml /path/to/DROC-Release/droc/DROC-xmi data/german.droc_gold_conll
python preprocess.py --input_dir data/droc_full --output_dir data/droc_full --seg_len 512 --language german --tokenizer_name german-nlp-group/electra-base-german-uncased --input_suffix droc_gold_conll --input_format conll-2012 --model_type electra
If you want to use the official evaluator, download and unzip official conll 2012 scorer in your specified data_dir
directory.
Evaluate a model on the dev/test set:
- Download the corresponding model file (
.mar
) and extractmodel*.bin
from it and place it indata_dir/<experiment_id>/
python evaluate.py [config] [model_id] [gpu_id] ([output_file])
- e.g. News, SemEval-2010, ELECTRA uncased (base) :
python evaluate.py se10_electra_uncased tuba10_electra_uncased_Apr30_08-52-00_56879 0
- e.g. News, SemEval-2010, ELECTRA uncased (base) :
python run.py [config] [gpu_id] (--model model_name)
- [config] can be any configuration in experiments.conf
- Log file will be saved at
data_dir/[config]/log_XXX.txt
- Models will be saved at
data_dir/[config]/model_XXX.bin
- Tensorboard is available at
data_dir/tensorboard
- Optional
--model model_name
can be specified to start training with weights from an existing model
Some important configurations in experiments.conf:
data_dir
: the full path to the directory containing dataset, models, log filescoref_depth
andhigher_order
: controlling the higher-order inference moduleincremental
: if true uses an incremental approach, otherwise the c2f mode is usedincremental_singleton
: give the explicit option to discard new mentions, enables the model to output singletonsincremental_teacherforcing
: whether to use teachers forcing when creating and updating entity representations, greatly improves convergence speedevict
whether to evict entity representations in the incremental model from active entity pool after a period of no new mentionsunconditional_eviction_limit
after how long of a distance with no mentions to evict an entitysingleton_eviction_limit
after how long of a distance of no mentions to evict a singleton entitybert_pretrained_name_or_path
: the name/path of the pretrained BERT model (HuggingFace BERT models)max_training_sentences
: the maximum segments to use when document is too long
Using the model name from the experiments.conf and the relative path to the model binary, create a model archive.
Optionally supply 'c2f' or 'incremental' as the model type (defaults to incremental).
In order to archive models you need to to install the model archiver: pip install torch-model-archiver
.
./archive-model.sh <model_name> <path_to_model/model.bin> [MODEL_VARIANT]
First install torchserve which is not part of our requirements.txt: pip install torchserve
Using torchserve models saved in this manner can be served, e.g.:
torchserve --models droc_incremental=<model_name>.mar --model-store . --foreground
Since the native dependencies may cause issues one some systems we have a custom torchserve docker image in docker/
.
The model handlers essentially provide the http API, there are two modes of operation for our handlers.
- Using raw text
- Using pretokenized text
Raw text is useful for direct visualization (example requests made using httpie), in this context you may also want to try the 'raw' output mode for relatively human-friendly text.
http https://127.0.0.1:8080/predictions/<model_name> output_format=raw text="Die Organisation gab bekannt sie habe Spenden veruntreut."
In the context of a larger language pipeline, pretokenization is often desirable:
http https://127.0.0.1:8080/predictions/<model_name> output_format=conll tokenized_sentences:='[["Die", "Organisation", "gab", "bekannt", "sie", "habe", "Spenden", "veruntreut", "."], ["Next", "sentence", "goes", "here", "!"]]'