The repository describes RIGA team submission to MultiCoNER II.
- Create a new environment
python -m venv venv
- Install dependencies
pip install -r requirements.txt
- Now your environment is ready. Next step is get the data from MultiCoNER download page. Put the data in
data
directory. - Convert the data using
parse_conll.py
scriptpython parse_conll.py --source_path {specify a path to dataset in CoNNL format}
- Start gathering context using
get_context.py
script. You'll need to specify your own API key and specifying the dataset split to use. You'll find aTODO
comments in the file for a help - On step 5. each context is collected separately for easier navigation and not querying the same sentences multiple times in case of error.
On this step you'll need to merge all of them into a single file. Usemerge_context.py
script for this purpose. You'll also need to change the dataset split in order to merge contexts for all train/dev/test datasets. - The last step is NER model fine-tuning. You could run
python train.py --help
command to get all argument list. During the competition we used mainly eitherdistilbert-base-uncased
(66M parameters) orxlm-roberta-large
models (558M parameters).