Project for the Language Processing exam.
The repository is divided into 2 parts:
- Grammar Design (
grammar_design
folder) - NLP project (
NLP
folder)
Grammar Design
Grammar Design Duccio Meconcelli
The original text of the Assignment:Using lark implement a parser for the definition of functions, with the following rules
- the functions are defined as: function name(par1,par2,…) { return par1 op par2 op par3…; }
where name is the function name with the usual restrictions (an alphanumeric string beginning with a letter), par1.. are the function parameters whose names follow the same rules as variables names, op is + or * (sum or product). The function body contains only the return instruction that involves the parameters.
- assume that only one function can be defined
- after the function definition, there are the calls whose syntax is: "name(cost1,cost2,…);" where name is the name of a defined function, cost1,… are numeric constants in the same
number as the function arguments.
- print the result of each function call
Grammar Design Sofia Albini
Using lark implement a parser for managing the “switch” statement in a simplified version.-
the variable used in the switch is one integer variable in a predefined set of two variables x, y. The values to x, y are assigned before the if statement (assume 0 if there is no assignment)
x = 1; y = 2;
-
the switch instruction has the following syntax
switch(var) { case 0: z=cost0; break;
….. case N: z=costN; break;
default: z=costD; break; }
-
the instruction contains only the assignment of a constant value to the variable z
-
at the end print the value of the variable z
This project aims to develop a system for transcribing audio into text using the powerful transformers library. The primary models utilized in this project are wav2vec2 xlsr and Whisper, both of which are state-of-the-art models for speech recognition tasks.
The wav2vec2 xlsr model is particularly well-suited for multilingual speech recognition tasks, as it has been trained on a diverse range of languages. On the other hand, the Whisper model is specifically designed for low-resource languages, making it an excellent choice for improving transcription accuracy in challenging scenarios.
The code is runnable on Colab on this link:
-
CTC:
-
Model Wav2Vec2 base:
-
Model Wav2Vec2 XLS-R (53 language):
-
Modello XLS-R 300M di parametri addestrato su 128 lingue (anche italiano):
-
To train and evaluate these models, we will be leveraging the common voice 11, common voice 16, and fleurs datasets. These datasets consist of a vast collection of multilingual audio recordings and their corresponding transcriptions, enabling us to build a robust and accurate transcription system.
Dataets used:
- PolyAI/minds14
- Mozilla/common_voice sono state prese 2 versioni:
- Google/fleurs
For training you need to define several options and you can do this either using environment variables or command line options. There are 2 main files:
NLP/script/main_ctc.py
-> script to train/test CTC based networks (Wav2Vec2)NLP/script/main_seq2seq.py
-> script to train/test SEQ2SEQ based networks (Whisper)
The command always needs the --output_dir
option which indicates the path where to save the information and the model.
For the training/evaluate of seq2seq model it is mandatory to set this 2 command line options if you want to get the metrics:
--predict_with_generate
--generation_max_length="250"
ALL Environment variables
-
DATASET_NAME = "data"
: name of the dataset (others e.g. : "google/fleurs", "mozilla-foundation/common_voice_16_0", ...) -
DATASET_CONFIG_NAME = "it_it"
-
TRAIN_SPLIT_NAME = "train"
-
EVAL_SPLIT_NAME = "validation"
-
TEXT_COLUMN_NAME= "transcription"
: column name in the dataset corresponding to the transcription -
LANGUAGE = "Italian"
: language of the output transcription -
HUGGING_FACE_TOKEN = *your_huggingface_token*
(for dataset with required authentication) -
ATTENTION_DROPOUT = 0.01
-
ACTIVATION_DROPOUT = 0.05
-
HIDDEN_DROPOUT= 0.0
-
FINAL_DROPOUT = 0.1
-
FREEZE_FEATURE_ENCODER = True
-
PREPROCESSING_NUM_WORKERS = 2
-
MAX_DURATION_IN_SECONDS = 40
: Max number in seconds for the audio to be accepted in the training -
MIN_DURATION_IN_SECONDS = 0
: Min number in seconds for the audio to be accepted in the training -
MAX_STEPS = 2000
: Define how many steps do for training, otherwhise useNUM_TRAIN_EPOCHS
-
PER_DEVICE_TRAIN_BATCH_SIZE = 16
-
PER_DEVICE_EVAL_BATCH_SIZE = 8
-
GRADIENT_ACCUMULATION_STEPS = 2
-
LEARNING_RATE = 0.0001
-
WARMUP_STEPS = 500
-
LENGTH_COLUMN_NAME="input_length"
-
SAVE_STEPS=200
-
EVAL_STEPS=100
-
SAVE_TOTAL_LIMIT= 2
-
GRADIENT_CHECKPOINTING = True
-
GROUP_BY_LENGTH = True
-
FP16 = True
-
RESUME_FROM_CHECKPOINT=True
-
(MODEL = "your_model_name")
: I advise against putting it as an environment variable and instead entering it as a command line option (e.g.--model_name=facebook/wav2vec2-xls-r-300m
)
Example usage for training whisper-tiny model on google/fleurs dataset and evaluate on it at each EVAL_STEPS
:
python NLP/script/main_seq2seq.py --model_name="openai/whisper-tiny" --output_dir="training_whisper-tiny" --predict_with_generate --generation_max_length="250" --do_eval --do_train
Then it's mandatory to set up the propper Environment Variables for downloading and setting the dataset and training procedures.
Model | train examples | Training Loss | Epoch | Step | Validation Loss | Wer |
---|---|---|---|---|---|---|
seq2seq tiny minds | 300 | 0.07186 | 105 | 1000 | 0.0001 | 0.5296 |
ctc minds | 500 | 0.0175 | 99.46 | 2900 | 0.5244 | 0.3319 |
seq2seq small minds | 500 | No log | 24.71 | 420 | 0.6700 | 0.2459 |
seq2seq small fleurs | 3000 | 0.0003 | 21.05 | 2000 | 0.3992 | 0.1394 |
seq2seq base common11 | 7000 | 0.0046 | 9.13 | 2000 | 0.6976 | 0.3188 |
ctc common11 | 7000 | 0.0967 | 9.13 | 2000 | 0.2142 | 0.1859 |
seq2seq tiny common16 | 7000 | 0.0059 | 9.13 | 2000 | 0.8071 | 0.3797 |
ctc common16 | 7000 | 0.1011 | 9.13 | 2000 | 0.2198 | 0.1835 |
seq2seq tiny custom | 40 | 0.108733 | 533 | 800 | 3.42*10^-5 | 0.002 |
dataset | examples | model | WER |
---|---|---|---|
costum | 40 | openai/whisper-tiny | 1.0026 |
costum | 40 | openai/whisper-base | 0.7810 |
costum | 40 | openai/whisper-small | 0.6030 |
costum | 40 | openai/whisper-large | 0.5052 |
costum | 40 | seq2seq_tiny_common16_7000 | 1.1213 |
costum | 40 | seq2seq_tiny_MINDS_300 | 0.9311 |
costum | 40 | seq2seq_base_common11_7000 | 0.8621 |
costum | 40 | seq2seq_small_fleurs_3000 | 0.6876 |
costum | 40 | seq2seq_small_MINDS | 0.7016 |
costum | 40 | ctc_300M_common11_7000 | 0.6323 |
costum | 40 | ctc_common16_7000 | 0.5983 |
fleurs | 300 | whispertiny-custom | 0.63 |
fleurs | 300 | openai/whisper-tiny | 0.44 |
We have implemented a bot on Telegram to make inference and transcribe audio files into text.
The script can be started in the Telegram folder and through the code in bot.py
, requirementx.txt
inside the telegram folder contains only useful libraries to make the bot work.
The bot uses the PyTelegramBot library to instantiate the bot using the appropriate token, then subsequently uses ASR models (seq2seq) for transcription.
The bot checks with each message whether the file is an audio file, if this is the case then it checks the length of the audio file:
If the audio lasts more than 30 seconds then the Faster-Whisper model is used to quickly generate the transcription of even very long audio and subsequently a summary is applied using the T5 model (a finetuning of the model is used by default: it5-base-summarization ) and subsequently the the summary is sent as a message to the user while the full transcript is written on a Telegraph page and sent via link to the user.
If the audio lasts less than 30 seconds, speculative inference and 2 models are exploited (they can be chosen via environment variables but one must be larger than the other) through which it is possible to save execution time by executing a first transcribe onto a smaller model and then give the larger model the more difficult parts.
must be set for the bot to work correctly:
TELEGRAM_TOKEN = *your_token*
TELEGRAM_BIG_MODEL = "*name_or_path_main_model*"
(e.g. openai/whisper-large)TELEGRAM_SMALL_MODEL = "*name_or_path_assistent_model*"
(e.g. openai/whisper-tiny)TELEGRAM_FASTER_MODEL = "medium"
(possible options: tiny, base, small, large, large-v2, large-v3)TELEGRAM_SUMMARY_MODEL = "*name_or_path_assistent_model*"
(e.g. efederici/it5-base-summarization)
-
FineTune on Costum Datasets:
-
State of Art and Optimizations!: