Skip to content

Quantyca/deepitalian

Repository files navigation

Italian Language Model for Fast.ai ULMFiT

This is the repo for the pre-trained Italian Language Model for fast.ai (see ULMFiT models at http:https://nlp.fast.ai/) , based on Italian wikipedia dump.

Resources available:

  • Two parametric workbooks (tested with fastai v1 rev. 51) to tokenize the dataset and to train the model (in this repo).
  • The basic CSVs with the data derived from wikipedia and created using the official fast.ai process with 400M tokens (https://github.com/fastai/fastai/tree/master/courses/dl2/imdb_scripts step 0): train csv and val csv
  • The merged CSV for language model training: merged csv
  • A serialized loader with a corpus of 100M tokens and a vocab size of 60.000 words (we downsample the above file using .use_partial_data(p_in_partial_data_pct, seed=42) with a pct of .25 in the datablock api: corpus
  • The corresponding itos file (to be used on step 2 and 3 of ULMFiT approach): itos
  • The trained model (26.8 perplexity on validation): model. With this model we achieved 96.5% accuracy on classifications of sentiment in restaurant reviews.

Work is heavily inspired by https://github.com/tchambon/deepfrench and made with ❤️ by Quantyca Analytics Team

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages