TweebankNLP

This repo contains the new Tweebank-NER dataset and off-the-shelf Twitter-Stanza pipeline for state-of-the-art Tweet NLP, as described in Annotating the Tweebank Corpus on Named Entity Recognition and Building NLP Models for Social Media Analysis:

Tweebank-NER V1.0 is the annotated NER dataset based on Tweebank V2, the main UD treebank for English Twitter NLP tasks
the Twitter-Stanza pipeline provides pre-trained Tweet NLP models (NER, tokenization, lemmatization, POS tagging, dependency parsing) with state-of-the-art or competitive performance. The models are fully compatible with Stanza and provide both Python and command-line interfaces for users.

References

If you use this repository in your research, please kindly cite our paper as well as the Stanza papers.

@article{jiang2022tweetnlp,
    title={Annotating the Tweebank Corpus on Named Entity Recognition and Building NLP Models for Social Media Analysis},
    author={Jiang, Hang and Hua, Yining and Beeferman, Doug and Roy, Deb},
    journal={arXiv preprint arXiv:2201.07281},
    year={2022}
}

Installation

# please install from the source
pip install -e ./twitter-stanza
pip install pythainlp

# download glove and pre-trained models
sh download_twitter_resources.sh

Python Interface for Twitter-Stanza

Note that Stanza allows both BIO- and BIOES-schemed annotations for training/dev/test data, but for decoding the predictions, only BIOES is used. In the evaluation process, tags are merged and thus schemes do not affect the performance scores.

import stanza

# config for the `en_tweet` models (models trained only on Tweebank)
config = {
          'processors': 'tokenize,lemma,pos,depparse,ner',
          'lang': 'en',
          'tokenize_pretokenized': True, # disable tokenization
          'tokenize_model_path': './twitter-stanza/saved_models/tokenize/en_tweet_tokenizer.pt',
          'lemma_model_path': './twitter-stanza/saved_models/lemma/en_tweet_lemmatizer.pt',
          "pos_model_path": './twitter-stanza/saved_models/pos/en_tweet_tagger.pt',
          "depparse_model_path": './twitter-stanza/saved_models/depparse/en_tweet_parser.pt',
          "ner_model_path": './twitter-stanza/saved_models/ner/en_tweet_nertagger.pt',
}

# Initialize the pipeline using a configuration dict
stanza.download("en")
nlp = stanza.Pipeline(**config)
doc = nlp("Oh ikr like Messi better than Ronaldo but we all like Ronaldo more")
print(doc) # Look at the result

Command-line Interface for Twitter-Stanza

NER

We provide two pre-trained Stanza NER models:

en_tweetwnut17: trained on TB2+WNUT17
en_tweet: trained on TB2

NER performance comparison among spaCy, FLAIR, and Twitter-Stanza:

Models	Training data	NER (F-micro)
spaCy	TB2	52.20
	TB2+WNUT17	53.89
FLAIR	TB2	62.12
	TB2+WNUT17	59.08
Stanza	TB2	60.14
	TB2+WNUT17	62.53

source twitter-stanza/scripts/config.sh
cd twitter-stanza

shorthand=en_tweetwnut17
python stanza/utils/training/run_ner.py ${shorthand} \
--mode predict \
--score_test \
--wordvec_file ../data/wordvec/English/en.twitter100d.xz \
--eval_file data/ner/en_tweet.test.json \
--save_dir ./saved_models/ner \
--save_name ${shorthand}_nertagger.pt \
--scheme bio

Syntactic NLP Models

We provide two pre-trained models for each NLP task, please specify the following shorthand:

en_tweetewt: the model trained on TB2+UD-English-EWT
en_tweet: the model trained on TB2

Syntactic NLP performance comparison among spaCy, FLAIR, and Twitter-Stanza:

Models	Training data	Tokens	Lemmas	UPOS	UAS	LAS
spaCy	TB2	98.57	---	86.72	66.93	58.79
	TB2+EWT	95.57	---	88.84	72.06	63.84
FLAIR	TB2	---	96.18	87.85	---	---
	TB2+EWT	---	84.54	88.19	---	---
Stanza	TB2	98.64	98.65	93.20	79.28	74.34
	TB2+EWT	98.59	85.45	93.53	82.13	77.82

1. Tokenization

shorthand=en_tweet 
python stanza/utils/training/run_tokenizer.py ${shorthand} \
--mode predict \
--score_test \
--txt_file data/tokenize/en_tweet.test.txt \
--label_file  data/tokenize/en_tweet-ud-test.toklabels \
--no_use_mwt

2. Lemmatization

shorthand=en_tweet
python stanza/utils/training/run_lemma.py ${shorthand} \
--mode predict \
--score_test \
--gold_file data/lemma/en_tweet.test.gold.conllu \
--eval_file data/lemma/en_tweet.test.in.conllu

3. POS Tagging

shorthand=en_tweetewt
python stanza/utils/training/run_pos.py ${shorthand} \
--mode predict \
--score_test \
--eval_file data/pos/en_tweet.test.in.conllu \
--gold_file data/pos/en_tweet.test.gold.conllu \
--wordvec_file ../data/wordvec/English/en.twitter100d.xz \
--load_name ./saved_models/pos/${shorthand}_tagger.pt

4. Dependency Parsing

shorthand=en_tweetewt
python stanza/utils/training/run_depparse.py ${shorthand} \
--mode predict \
--score_test \
--wordvec_file ../data/wordvec/English/en.twitter100d.xz \
--eval_file data/depparse/en_tweet.test.in.conllu \
--gold_file data/depparse/en_tweet.test.gold.conllu

Training Twitter-Stanza

Please refer to the TRAIN_README.md for training the Twitter-Stanza neural pipeline.

Acknowledgement

The Twitter-Stanza pipeline is a friendly fork from the Stanza libaray with a few modifications to adapt to tweets. The repository is fully compatible with Stanza. This research project is funded by MIT Center for Constructive Communication (CCC). This repository is mainly contributed by Yining Hua (@ningkko) and Hang Jiang (@hjian42).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TweebankNLP

References

Installation

Python Interface for Twitter-Stanza

Command-line Interface for Twitter-Stanza

NER

Syntactic NLP Models

1. Tokenization

2. Lemmatization

3. POS Tagging

4. Dependency Parsing

Training Twitter-Stanza

Acknowledgement

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
Tweebank-NER-v1.0		Tweebank-NER-v1.0
data		data
twitter-stanza		twitter-stanza
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
TRAIN_README.md		TRAIN_README.md
download_twitter_resources.sh		download_twitter_resources.sh

License

ningkko/TweebankNLP

Folders and files

Latest commit

History

Repository files navigation

TweebankNLP

References

Installation

Python Interface for Twitter-Stanza

Command-line Interface for Twitter-Stanza

NER

Syntactic NLP Models

1. Tokenization

2. Lemmatization

3. POS Tagging

4. Dependency Parsing

Training Twitter-Stanza

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages