This repo contains the new Tweebank-NER
dataset and off-the-shelf Twitter-Stanza
pipeline for state-of-the-art Tweet NLP, as described in Annotating the Tweebank Corpus on Named Entity Recognition and Building NLP Models for Social Media Analysis:
Tweebank-NER V1.0
is the annotated NER dataset based on Tweebank V2, the main UD treebank for English Twitter NLP tasks- the
Twitter-Stanza
pipeline provides pre-trained Tweet NLP models (NER, tokenization, lemmatization, POS tagging, dependency parsing) with state-of-the-art or competitive performance. The models are fully compatible with Stanza and provide both Python and command-line interfaces for users.
If you use this repository in your research, please kindly cite our paper as well as the Stanza papers.
@article{jiang2022tweetnlp,
title={Annotating the Tweebank Corpus on Named Entity Recognition and Building NLP Models for Social Media Analysis},
author={Jiang, Hang and Hua, Yining and Beeferman, Doug and Roy, Deb},
journal={arXiv preprint arXiv:2201.07281},
year={2022}
}
# please install from the source
pip install -e ./twitter-stanza
pip install pythainlp
# download glove and pre-trained models
sh download_twitter_resources.sh
Note that Stanza allows both BIO- and BIOES-schemed annotations for training/dev/test data, but for decoding the predictions, only BIOES is used. In the evaluation process, tags are merged and thus schemes do not affect the performance scores.
import stanza
# config for the `en_tweet` models (models trained only on Tweebank)
config = {
'processors': 'tokenize,lemma,pos,depparse,ner',
'lang': 'en',
'tokenize_pretokenized': True, # disable tokenization
'tokenize_model_path': './twitter-stanza/saved_models/tokenize/en_tweet_tokenizer.pt',
'lemma_model_path': './twitter-stanza/saved_models/lemma/en_tweet_lemmatizer.pt',
"pos_model_path": './twitter-stanza/saved_models/pos/en_tweet_tagger.pt',
"depparse_model_path": './twitter-stanza/saved_models/depparse/en_tweet_parser.pt',
"ner_model_path": './twitter-stanza/saved_models/ner/en_tweet_nertagger.pt',
}
# Initialize the pipeline using a configuration dict
stanza.download("en")
nlp = stanza.Pipeline(**config)
doc = nlp("Oh ikr like Messi better than Ronaldo but we all like Ronaldo more")
print(doc) # Look at the result
We provide two pre-trained Stanza NER models:
en_tweetwnut17
: trained onTB2+WNUT17
en_tweet
: trained onTB2
NER performance comparison among spaCy, FLAIR, and Twitter-Stanza:
Models | Training data | NER (F-micro) |
---|---|---|
spaCy | TB2 | 52.20 |
TB2+WNUT17 | 53.89 | |
FLAIR | TB2 | 62.12 |
TB2+WNUT17 | 59.08 | |
Stanza | TB2 | 60.14 |
TB2+WNUT17 | 62.53 |
source twitter-stanza/scripts/config.sh
cd twitter-stanza
shorthand=en_tweetwnut17
python stanza/utils/training/run_ner.py ${shorthand} \
--mode predict \
--score_test \
--wordvec_file ../data/wordvec/English/en.twitter100d.xz \
--eval_file data/ner/en_tweet.test.json \
--save_dir ./saved_models/ner \
--save_name ${shorthand}_nertagger.pt \
--scheme bio
We provide two pre-trained models for each NLP task, please specify the following shorthand:
en_tweetewt
: the model trained onTB2+UD-English-EWT
en_tweet
: the model trained onTB2
Syntactic NLP performance comparison among spaCy, FLAIR, and Twitter-Stanza:
Models | Training data | Tokens | Lemmas | UPOS | UAS | LAS |
---|---|---|---|---|---|---|
spaCy | TB2 | 98.57 | --- | 86.72 | 66.93 | 58.79 |
TB2+EWT | 95.57 | --- | 88.84 | 72.06 | 63.84 | |
FLAIR | TB2 | --- | 96.18 | 87.85 | --- | --- |
TB2+EWT | --- | 84.54 | 88.19 | --- | --- | |
Stanza | TB2 | 98.64 | 98.65 | 93.20 | 79.28 | 74.34 |
TB2+EWT | 98.59 | 85.45 | 93.53 | 82.13 | 77.82 |
shorthand=en_tweet
python stanza/utils/training/run_tokenizer.py ${shorthand} \
--mode predict \
--score_test \
--txt_file data/tokenize/en_tweet.test.txt \
--label_file data/tokenize/en_tweet-ud-test.toklabels \
--no_use_mwt
shorthand=en_tweet
python stanza/utils/training/run_lemma.py ${shorthand} \
--mode predict \
--score_test \
--gold_file data/lemma/en_tweet.test.gold.conllu \
--eval_file data/lemma/en_tweet.test.in.conllu
shorthand=en_tweetewt
python stanza/utils/training/run_pos.py ${shorthand} \
--mode predict \
--score_test \
--eval_file data/pos/en_tweet.test.in.conllu \
--gold_file data/pos/en_tweet.test.gold.conllu \
--wordvec_file ../data/wordvec/English/en.twitter100d.xz \
--load_name ./saved_models/pos/${shorthand}_tagger.pt
shorthand=en_tweetewt
python stanza/utils/training/run_depparse.py ${shorthand} \
--mode predict \
--score_test \
--wordvec_file ../data/wordvec/English/en.twitter100d.xz \
--eval_file data/depparse/en_tweet.test.in.conllu \
--gold_file data/depparse/en_tweet.test.gold.conllu
Please refer to the TRAIN_README.md for training the Twitter-Stanza neural pipeline.
The Twitter-Stanza pipeline is a friendly fork from the Stanza libaray with a few modifications to adapt to tweets. The repository is fully compatible with Stanza. This research project is funded by MIT Center for Constructive Communication (CCC). This repository is mainly contributed by Yining Hua (@ningkko) and Hang Jiang (@hjian42).