Skip to content

PyTorch implementation of neural end-to-end coreference resolution systems

Notifications You must be signed in to change notification settings

jfhetzer/e2e-coref

Repository files navigation

Coreference Resolution Systems: PyTorch Implementation

Pytorch implementation of the independent model with BERT and SpanBERT proposed in BERT for Coreference Resolution: Baselines and Analysis and SpanBERT: Improving Pre-training by Representing and Predicting Spans.

This implementation contains additional scripts and configurations I used in the context of my master's thesis. That includes the use of other pre-trained language models, training the model on various German datasets and improving its performance for low resource languages leveraging transfer learning. For the vanilla versions of the fundamental coreference resolution models see the Repository Overview.

This implementation is based upon the original implementation by the papers authors. The model and misc_ packages are written from scratch, whereas some scripts in the setup package and the entire eval package are borrowed with almost no changes from the original implementation. Optimization for mention pruning inspired by Ethan Yang.

Repository Overview

This repository contains three additional branches of different PyTorch models. These are reimplemantations of models originally implemented with Tensorflow. The following tables gives an overview over the branches, the corresponding papers and original implementations.

Branch Paper Implementation
e2e-coref End-to-end Neural Coreference Resolution GitHub
c2f-coref Higher-order Coreference Resolution with Coarse-to-fine Inference GitHub
bert-coref BERT for Coreference Resolution: Baselines and Analysis GitHub

Requirements

This project was written with Python 3.8.5 and PyTorch 1.7.1. For installation details regarding PyTorch please visit the official website. Further requirements are listed in the requirements.txt and can be installed via pip: pip install -r requirements.txt

Setup

Hint: Run setup.sh in an environment with Python 2.7 so the CoNLL-2012 scripts are executed by the correct interpreter

To obtain all necessary data for training, evaluation, and inference in English run setup.sh with the path to the OntoNotes 5.0 folder (often named ontonotes-release-5.0).

e.g. $ ./setup.sh /path/to/ontonotes-release-5.0

Run python setup/bert_tokenize.py -c <conf> to tokenize and segment the data before training, evaluating or testing that specific configuration. See coref.cong for the available configurations.

The misc folder contains scripts to convert the German datasets Tüba-D/Z v10/11, SemEval-2010 and DIRNDL into the required CoNLL-2012 format. For the SemEval-2010 make sure to remove the singletons in order to get comparable results. Use minimize.py and bert_tokenize.py to obtain the desired file to train with.

Training

Run the training with python train.py -c <config> -f <folder> --cpu --amp --check --split.Select with conf one of the four available configurations (bert-base, bert-large, spanbert-base, spanbert-large). The parameter folder names the folder the snapshots, taken during the training, are saved to. If the given folder already exists and contains at least one snapshot the training is restarted loading the latest snapshot. The optional flags cpu and amp can be set to train exclusively on the CPU or to use the automatic mixed precision training. Gradient checkpointing can be used with the option check to further reduce the GPU memory usage or the model can even be split up onto two GPUs with split.

Fine-tuning on German

To fine-tune multilingual models trained on the OntoNotes 5.0 dataset on German datasets, adapt the configuration in coref.conf and place the latest snapshot into the same folder the fine-tuned model should write its snapshots to. Then start training as described above.

For easier parameter tuning use train_fine.py and write a shell script to programmatically pass the learning rates and epochs into the training.

Adversarial Cross-lingual Training

To redo the adversarial training described in the thesis run python train_adv.py. The only configuration setup for this training is the bert-multilingual-base. Make sure to have created the adv_data_file besides the English data before training.

To evalute the trained model on German comment in the desired dataset in the coref.conf. For validate if the training brought English and German embeddings closer together as desired use the analyze_emb_similarity.py script.

Evaluation

Run the evaluation with python evaluate.py -c <conf> -p <pattern> --cpu --amp --split. All snapshots in the data/ckpt folder that match the given pattern are evaluated. This works with simple snapshots (pt) as well as with snapshots with additional metadata (pt.tar). See Training for details on the remaining parameters.

To evaluate previous predictions dumped during training or evaluation use the eval_dumped_preds.py script.

About

PyTorch implementation of neural end-to-end coreference resolution systems

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published