Skip to content

BERT Tranformer Based Model for Punctuation and Capitalization Restoration

Notifications You must be signed in to change notification settings

shashikg/transformer-punct-and-capit

Repository files navigation

BERT Based Model for Punctuation and Capitalization Restoration

Features:

  • Uses Huggingface Tranformer library for base transformer architecture.
  • Pytorch Lightning is used for training and checkpoints.
  • Easy config based model description for easy experimenttation and reaearch.
  • Can be exported as a pytorch quantized model for faster inference on CPU.
  • Includes helper function for data preparation, text normalization, and offline sentence augmentation specific for punctuation and capitalization restoration.

Quick guide:

# Install requirements:
pip install -r requirements.txt

# Downloads raw text corpus from tatoeba for english language
bash download_tatoeba_en_sent.sh

# Preprocess raw text data. Check config file for more details
python preprocess_raw_text_data.py --config="example_configs/preprocess_config_en.yaml"

# Merge multiple data files into one, apply sent augmentation, and tokenization. Check config file for more details
python merge_and_tokenize_datasets.py --config="example_configs/model_config_en.yaml"

# Merge multiple data files into one, apply sent augmentation, and tokenization. Check config file for more details
python train_punct_and_capit_model.py --config="example_configs/model_config_en.yaml"

For inference:

from transformer_punct_and_capit.models import TransformerPunctAndCapitModel

model_path="experiments/model.pcm" # pcm_checkpoint path
model = TransformerPunctAndCapitModel.restore_model(model_path, device='cuda')

model.predict("how are you") # Single example
# Output: ["How are you?"]

model.predict_batch(["how are you"], batch_size=64, show_pbar=True) # Batch example
# Output: ["How are you?"]

About

BERT Tranformer Based Model for Punctuation and Capitalization Restoration

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published