Name	Name	Last commit message	Last commit date
Latest commit History 109 Commits
ASR	ASR
classification	classification
generation	generation
ner	ner
pos	pos
punctuation_restoration	punctuation_restoration
results	results
summarization	summarization
translation	translation
utils	utils
.gitignore	.gitignore
LICENSE	LICENSE
README.md	README.md
classify.py	classify.py
generate.py	generate.py
ner.py	ner.py
pos.py	pos.py
punctuate.py	punctuate.py
setup.py	setup.py
speech.py	speech.py
summarize.py	summarize.py
translate.py	translate.py

NLP Toolkit

Library containing state-of-the-art models for Natural Language Processing tasks *Note this repo is continually being updated (see To do list)

Tasks:

Classification
Automatic Speech Recognition
Text Summarization
Machine Translation
Natural Language Generation
Punctuation Restoration
Named Entity Recognition

Benchmark Results
References

Pre-requisites

torch==1.2.0 ; spacy==2.1.8 ; torchtext==0.4.0 ; seqeval==0.0.12

** Pre-trained models (XLNet, BERT, GPT-2) are courtesy of huggingface (https://github.com/huggingface/pytorch-transformers)

1) Classification

The goal of classification is to segregate documents into appropriate classes based on their text content. Currently, the classification toolkit uses the following models:

Text-based Graph Convolution Networks (GCN)
Bidirectional Encoder Representations from Transformers (BERT)
XLNet

Format of datasets files

The training data (default: train.csv) should be formatted into two columns “text” and “label” respectively, with rows being the documents index. “text” contains the raw text and “label” contains the corresponding label (integers 0, 1, 2… depending on the number of classes)

The infer data (default: infer.csv) should be formatted into at least one column “text” being the raw text and rows being the documents index. Optional column “label” can be added and --train_test_split argument set to 1 to use infer.csv as the test set for model verification.

Running the model

Run classify.py with arguments below.

classify.py [-h] 
	[--train_data TRAIN_DATA (default: "./data/train.csv")] 
	[--infer_data INFER_DATA (default: "./data/infer.csv")]            
	[--max_vocab_len MAX_VOCAB_LEN (default: 7000)]  
	[--hidden_size_1 HIDDEN_SIZE_1 (default: 330)]
	[--hidden_size_2 HIDDEN_SIZE_2 (default: 130)]
	[--tokens_length TOKENS_LENGTH (default: 200)] 
	[--num_classes NUM_CLASSES (default: 2)]
	[--train_test_split TRAIN_TEST_SPLIT (default: 0)]
	[--test_ratio TEST_RATIO (default: 0.1)] 
	[--batch_size BATCH_SIZE (default: 32)]      
	[--gradient_acc_steps GRADIENT_ACC_STEPS (default: 1)]
	[--max_norm MAX_NORM (default: 1)] 
	[--num_epochs NUM_EPOCHS (default: 1700)] 
	[--lr LR default=0.0031]
	[--model_no MODEL_NO (default: 0 (0: Graph Convolution Network 	(GCN), 1: BERT, 2: XLNet))] 
	[--train TRAIN (default:1)]  
	[--infer INFER (default: 0 (Infer input sentence labels from 	trained model))]

The script outputs a results.csv file containing the indexes of the documents in infer.csv and their corresponding predicted labels.

2) Automatic Speech Recognition

Automatic Speech Recognition (ASR) aims to convert audio signals into text. This library contains the following models for ASR:

Speech-Transformer
Listen-Attend-Spell (LAS)

Format of dataset files

The folder containing the dataset should have the following structure: folder/speaker/chapter Within the chapter subdirectory, the audio files (in .flac format) are named speaker-chapter-file_id (file_id In running order) The transcript .txt file for the files within the chapter should be located in the chapter subdirectory. In the transcript file, each row should consist of the speaker-chapter-file_id (space) transcript.

Running the model

Run speech.py with arguments below

speech.py [-h] 
	[--folder FOLDER (default: train-clean-5")] 
	[--level LEVEL (default: “word")]   
	[--use_lg_mels USE_LG_MELS (default: 1)]
	[--use_conv USE_CONV (default: 1)]
	[--n_mels N_MELS (default: 80)]
	[--n_mfcc N_MFCC (default: 13)]
	[--n_fft N_FFT (default: 25)]
	[--hop_length HOP_LENGTH (default: 10)]
	[--max_frame_len MAX_FRAME_LEN (default: 1000)]
	[--d_model D_MODEL (default: 64)]
	[--ff_dim FF_DIM (default: 128)]
	[--num NUM (default: 6)]
	[--n_heads N_HEADS(default: 4)]
	[--batch_size BATCH_SIZE (default: 30)]
	[--num_epochs NUM_EPOCHS (default: 8000)] 
	[--lr LR default=0.003]    
	[--gradient_acc_steps GRADIENT_ACC_STEPS (default: 4)]
	[--max_norm MAX_NORM (default: 1)] 
	[--model_no MODEL_NO (default: 0 (0: Transformer, 1: LAS))]  
	[--train TRAIN (default:1)]  
	[--infer INFER (default: 0 (Infer input sentence labels from 	trained model))]

3) Text Summarization

Text summarization aims to distil a paragraph chunk into a few sentences that capture the essential information. This library contains the following models for text summarization:

Convolutional Transformer
Seq2Seq (LAS architecture)

Format of dataset files

One .csv file for each text/summary pair. Within the text/summary .csv file, text is followed by summary, with summary points annotated by @highlights (summary)

Running the model

Run summarize.py with arguments below

summarize.py [-h] 
	[--data_path DATA_PATH] 
	[--level LEVEL (default: “bpe")]   
	[--bpe_word_ratio BPE_WORD_RATIO (default: 0.7)]
	[--bpe_vocab_size BPE_VOCAB_SIZE (default: 7000)]
	[--max_features_length MAX_FEATURES_LENGTH (default: 200)]
	[--d_model D_MODEL (default: 128)]
	[--ff_dim FF_DIM (default: 128)]
	[--num NUM (default: 6)]
	[--n_heads N_HEADS(default: 4)]
	[--LAS_embed_dim LAS_EMBED_DIM (default: 128)]
	[--LAS_hidden_size LAS_HIDDEN_SIZE (default: 128)]
	[--batch_size BATCH_SIZE (default: 30)]
	[--num_epochs NUM_EPOCHS (default: 8000)] 
	[--lr LR default=0.003]    
	[--gradient_acc_steps GRADIENT_ACC_STEPS (default: 4)]
	[--max_norm MAX_NORM (default: 1)] 
	[--model_no MODEL_NO (default: 0 (0: Transformer, 1: LAS))]  
	[--train TRAIN (default:1)]  
	[--infer INFER (default: 0 (Infer input sentence labels from 	trained model))]

4) Machine Translation

The goal of machine translation is to translate text from one form of language to another. This library contains the following models to accomplish this:

Transformer

Format of dataset files

A source .txt file with each line containing the text/sentence to be translated, and a target .txt file with each line containing the corresponding translated text/sentence

Running the model

Run translate.py with arguments below

translate.py [-h]  
	[--src_path SRC_PATH]
	[--trg_path TRG_PATH] 
	[--batch_size BATCH_SIZE (default: 50)]
	[--d_model D_MODEL (default: 512)]
	[--ff_dim FF_DIM (default: 2048)]
	[--num NUM (default: 6)]
	[--n_heads N_HEADS(default: 8)]
	[--max_encoder_len MAX_ENCODER_LEN (default: 80)]
	[--max_decoder_len MAX_DECODER_LEN (default: 80)]	
	[--num_epochs NUM_EPOCHS (default: 500)] 
	[--lr LR default=0.0001]    
	[--gradient_acc_steps GRADIENT_ACC_STEPS (default: 1)]
	[--max_norm MAX_NORM (default: 1)] 
	[--model_no MODEL_NO (default: 0 (0: Transformer))]  
	[--train TRAIN (default:1)]  
	[--evaluate EVALUATE (default:0)]
	[--infer INFER (default: 0)]

5) Natural Language Generation

Natural Language generation (NLG) aims to generate text based on past context. For instance, a chatbot can generate text replies based on the context of chat history. We currently have the following models for NLG:

Generative Pre-trained Transformer (GPT/GPT-2)

Format of dataset files

Generate free text from GPT/GPT-2 pre-trained models

Running the model

Run generate.py

generate.py

6) Punctuation Restoration

Given unpunctuated (and perhaps un-capitalized) text, punctuation restoration aims to restore the punctuation of the text for easier readability. Applications include punctuating raw transcripts from audio speech data etc.

Format of dataset files

Currently only supports TED talk transcripts format, whereby punctuated text is annotated by tags. Eg. . The is then used for training.

Running the model

Run punctuate.py

punctuate.py [-h] 
	[--data_path DATA_PATH] 
       [--level LEVEL (default: “bpe")]   
	[--bpe_word_ratio BPE_WORD_RATIO (default: 0.7)]
	[--bpe_vocab_size BPE_VOCAB_SIZE (default: 7000)]
	[--batch_size BATCH_SIZE (default: 32)]
	[--d_model D_MODEL (default: 512)]
	[--ff_dim FF_DIM (default: 2048)]
	[--num NUM (default: 6)]
	[--n_heads N_HEADS(default: 8)]
	[--max_encoder_len MAX_ENCODER_LEN (default: 80)]
	[--max_decoder_len MAX_DECODER_LEN (default: 80)]	
	[--LAS_embed_dim LAS_EMBED_DIM (default: 512)]
	[--LAS_hidden_size LAS_HIDDEN_SIZE (default: 512)]
	[--num_epochs NUM_EPOCHS (default: 500)] 
	[--lr LR default=0.0005]    
	[--gradient_acc_steps GRADIENT_ACC_STEPS (default: 2)]
	[--max_norm MAX_NORM (default: 1.0)] 
	[--T_max T_MAX (default: 5000)] 
	[--model_no MODEL_NO (default: 0 (0: Transformer))]  
	[--train TRAIN (default:1)]  
	[--infer INFER (default: 0 (Infer input sentence labels from 	trained model))]

7) Named Entity Recognition

In Named entity recognition (NER), the task is to recognise entities such as persons, organisations. Current models for this task: BERT

Format of dataset files

Running the model

Run ner.py

ner.py [-h] 
	[--train_path TRAIN_PATH] 
	[--test_path TEST_PATH]
	[--train TRAIN (default:1)]  
	[--evaluate EVALUATE (default:0)]

Benchmark Results

1) Classification (IMDB dataset – 25000 train, 25000 test data points)

Fine-tuned XLNet English Model (12-layer, 768-hidden, 12-heads, 110M parameters)

Fine-tuned BERT English Model (12-layer, 768-hidden, 12-heads, 110M parameters)

7) Named Entity Recognition (Conll2003 dataset)

Fine-tuned BERT English Model (uncased, 12-layer, 768-hidden, 12-heads, 110M parameters)

References

Attention Is All You Need, Vaswani et al, https://arxiv.org/abs/1706.03762
Graph Convolutional Networks for Text Classification, Liang Yao et al, https://arxiv.org/abs/1809.05679
Speech-Transformer: A No-Recurrence Sequence-To-Sequence Model For Speech Recognition, Linhao Dong et al, https://ieeexplore.ieee.org/document/8462506
Listen,Attend and Spell, William Chan et al, https://arxiv.org/abs/1508.01211

To do list

Include benchmark results for all tasks
Include more models for punctuation restoration, translation, NER
Include demo for trained models for ASR, punctuation restoration, NER, summarization, translation, classification

License

plkmo/NLP_Toolkit

Folders and files

Latest commit

History

Repository files navigation

NLP Toolkit

Contents

Pre-requisites

1) Classification

Format of datasets files

Running the model

2) Automatic Speech Recognition

Format of dataset files

Running the model

3) Text Summarization

Format of dataset files

Running the model

4) Machine Translation

Format of dataset files

Running the model

5) Natural Language Generation

Format of dataset files

Running the model

6) Punctuation Restoration

Format of dataset files

Running the model

7) Named Entity Recognition

Format of dataset files

Running the model

Benchmark Results

1) Classification (IMDB dataset – 25000 train, 25000 test data points)

Fine-tuned XLNet English Model (12-layer, 768-hidden, 12-heads, 110M parameters)

Fine-tuned BERT English Model (12-layer, 768-hidden, 12-heads, 110M parameters)

7) Named Entity Recognition (Conll2003 dataset)

Fine-tuned BERT English Model (uncased, 12-layer, 768-hidden, 12-heads, 110M parameters)

References

To do list

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Sponsor this project

Packages 0

Contributors 2

Languages

Packages