Language Processing Project

Project for the Language Processing exam.

The repository is divided into 2 parts:

Grammar Design (grammar_design folder)
NLP project (NLP folder)

Grammar Design

Grammar Design Duccio Meconcelli

The original text of the Assignment:

Using lark implement a parser for the definition of functions, with the following rules

the functions are defined as: function name(par1,par2,…) { return par1 op par2 op par3…; }

where name is the function name with the usual restrictions (an alphanumeric string beginning with a letter), par1.. are the function parameters whose names follow the same rules as variables names, op is + or * (sum or product). The function body contains only the return instruction that involves the parameters.

assume that only one function can be defined

after the function definition, there are the calls whose syntax is: "name(cost1,cost2,…);" where name is the name of a defined function, cost1,… are numeric constants in the same

number as the function arguments.

print the result of each function call

Grammar Design Sofia Albini

Using lark implement a parser for managing the “switch” statement in a simplified version.

the variable used in the switch is one integer variable in a predefined set of two variables x, y. The values to x, y are assigned before the if statement (assume 0 if there is no assignment)

x = 1; y = 2;
the switch instruction has the following syntax

switch(var) { case 0: z=cost0; break;

….. case N: z=costN; break;

default: z=costD; break; }
the instruction contains only the assignment of a constant value to the variable z
at the end print the value of the variable z

NLP Project

This project aims to develop a system for transcribing audio into text using the powerful transformers library. The primary models utilized in this project are wav2vec2 xlsr and Whisper, both of which are state-of-the-art models for speech recognition tasks.

The wav2vec2 xlsr model is particularly well-suited for multilingual speech recognition tasks, as it has been trained on a diverse range of languages. On the other hand, the Whisper model is specifically designed for low-resource languages, making it an excellent choice for improving transcription accuracy in challenging scenarios.

The code is runnable on Colab on this link:

ASR Models

CTC:
- Model Wav2Vec2 base:
  - Wav2vec2-base preaddestrato solo in poche lingue (non italiano)
  - Wav2vec2-base-it-voxpopuli-v2 (Modello base finetuning su dataset italiano)
- Model Wav2Vec2 XLS-R (53 language):
  - Wav2vec2-large-xlsr-53 originale non preaddestrato
  - modello finetuning in italiano Large
  - Modello XLS-R 300M di parametri addestrato su 128 lingue (anche italiano):
    - modello base 300M preaddestrato
    - modello finetuning in italiano
SEQ2SEQ:
- Model Whisper:

Dataset

To train and evaluate these models, we will be leveraging the common voice 11, common voice 16, and fleurs datasets. These datasets consist of a vast collection of multilingual audio recordings and their corresponding transcriptions, enabling us to build a robust and accurate transcription system.

Dataets used:

PolyAI/minds14
Mozilla/common_voice sono state prese 2 versioni:
- Versione 11
- Versione 16
Google/fleurs

Training Procedure:

For training you need to define several options and you can do this either using environment variables or command line options. There are 2 main files:

NLP/script/main_ctc.py -> script to train/test CTC based networks (Wav2Vec2)
NLP/script/main_seq2seq.py -> script to train/test SEQ2SEQ based networks (Whisper)

The command always needs the --output_dir option which indicates the path where to save the information and the model.

For the training/evaluate of seq2seq model it is mandatory to set this 2 command line options if you want to get the metrics:

--predict_with_generate
--generation_max_length="250"

ALL Environment variables

DATASET_NAME = "data": name of the dataset (others e.g. : "google/fleurs", "mozilla-foundation/common_voice_16_0", ...)
DATASET_CONFIG_NAME = "it_it"
TRAIN_SPLIT_NAME = "train"
EVAL_SPLIT_NAME = "validation"
TEXT_COLUMN_NAME= "transcription" : column name in the dataset corresponding to the transcription
LANGUAGE = "Italian" : language of the output transcription
HUGGING_FACE_TOKEN = *your_huggingface_token* (for dataset with required authentication)
ATTENTION_DROPOUT = 0.01
ACTIVATION_DROPOUT = 0.05
HIDDEN_DROPOUT= 0.0
FINAL_DROPOUT = 0.1
FREEZE_FEATURE_ENCODER = True
PREPROCESSING_NUM_WORKERS = 2
MAX_DURATION_IN_SECONDS = 40: Max number in seconds for the audio to be accepted in the training
MIN_DURATION_IN_SECONDS = 0: Min number in seconds for the audio to be accepted in the training
MAX_STEPS = 2000: Define how many steps do for training, otherwhise use NUM_TRAIN_EPOCHS
PER_DEVICE_TRAIN_BATCH_SIZE = 16
PER_DEVICE_EVAL_BATCH_SIZE = 8
GRADIENT_ACCUMULATION_STEPS = 2
LEARNING_RATE = 0.0001
WARMUP_STEPS = 500
LENGTH_COLUMN_NAME="input_length"
SAVE_STEPS=200
EVAL_STEPS=100
SAVE_TOTAL_LIMIT= 2
GRADIENT_CHECKPOINTING = True
GROUP_BY_LENGTH = True
FP16 = True
RESUME_FROM_CHECKPOINT=True
(MODEL = "your_model_name"): I advise against putting it as an environment variable and instead entering it as a command line option (e.g. --model_name=facebook/wav2vec2-xls-r-300m)

Example usage for training whisper-tiny model on google/fleurs dataset and evaluate on it at each EVAL_STEPS:

  python NLP/script/main_seq2seq.py --model_name="openai/whisper-tiny" --output_dir="training_whisper-tiny" --predict_with_generate --generation_max_length="250" --do_eval --do_train

Then it's mandatory to set up the propper Environment Variables for downloading and setting the dataset and training procedures.

Training results:

Model	train examples	Training Loss	Epoch	Step	Validation Loss	Wer
seq2seq tiny minds	300	0.07186	105	1000	0.0001	0.5296
ctc minds	500	0.0175	99.46	2900	0.5244	0.3319
seq2seq small minds	500	No log	24.71	420	0.6700	0.2459
seq2seq small fleurs	3000	0.0003	21.05	2000	0.3992	0.1394
seq2seq base common11	7000	0.0046	9.13	2000	0.6976	0.3188
ctc common11	7000	0.0967	9.13	2000	0.2142	0.1859
seq2seq tiny common16	7000	0.0059	9.13	2000	0.8071	0.3797
ctc common16	7000	0.1011	9.13	2000	0.2198	0.1835
seq2seq tiny custom	40	0.108733	533	800	3.42*10^-5	0.002

Test results:

dataset	examples	model	WER
costum	40	openai/whisper-tiny	1.0026
costum	40	openai/whisper-base	0.7810
costum	40	openai/whisper-small	0.6030
costum	40	openai/whisper-large	0.5052
costum	40	seq2seq_tiny_common16_7000	1.1213
costum	40	seq2seq_tiny_MINDS_300	0.9311
costum	40	seq2seq_base_common11_7000	0.8621
costum	40	seq2seq_small_fleurs_3000	0.6876
costum	40	seq2seq_small_MINDS	0.7016
costum	40	ctc_300M_common11_7000	0.6323
costum	40	ctc_common16_7000	0.5983
fleurs	300	whispertiny-custom	0.63
fleurs	300	openai/whisper-tiny	0.44

Telegram:

We have implemented a bot on Telegram to make inference and transcribe audio files into text. The script can be started in the Telegram folder and through the code in bot.py, requirementx.txt inside the telegram folder contains only useful libraries to make the bot work. The bot uses the PyTelegramBot library to instantiate the bot using the appropriate token, then subsequently uses ASR models (seq2seq) for transcription.

The bot checks with each message whether the file is an audio file, if this is the case then it checks the length of the audio file:

If the audio lasts more than 30 seconds then the Faster-Whisper model is used to quickly generate the transcription of even very long audio and subsequently a summary is applied using the T5 model (a finetuning of the model is used by default: it5-base-summarization ) and subsequently the the summary is sent as a message to the user while the full transcript is written on a Telegraph page and sent via link to the user.

If the audio lasts less than 30 seconds, speculative inference and 2 models are exploited (they can be chosen via environment variables but one must be larger than the other) through which it is possible to save execution time by executing a first transcribe onto a smaller model and then give the larger model the more difficult parts.

Environment variables for Telegram:

must be set for the bot to work correctly:

TELEGRAM_TOKEN = *your_token*
TELEGRAM_BIG_MODEL = "*name_or_path_main_model*" (e.g. openai/whisper-large)
TELEGRAM_SMALL_MODEL = "*name_or_path_assistent_model*" (e.g. openai/whisper-tiny)
TELEGRAM_FASTER_MODEL = "medium" (possible options: tiny, base, small, large, large-v2, large-v3)
TELEGRAM_SUMMARY_MODEL = "*name_or_path_assistent_model*" (e.g. efederici/it5-base-summarization)

Authors

Resurces

Github Repo Ufficiale ASR
Automatic Speech Recognition Hugging Face
recording audio in google colab
Esempio su come usare file audio per Inference
Finetuning Wav2Vec2 XLSR
Finetuning Whisper Model (seq2seq)
COLAB: Finetuning Whisper Model (seq2seq)
1 script for XLSR and Whisper
FineTune on Costum Datasets:
- Make your Own Audio Dataset
- HuggingSound
State of Art and Optimizations!:

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.github/workflows		.github/workflows
NLP		NLP
content		content
grammar_design		grammar_design
results		results
telegram		telegram
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Language Processing Project

NLP Project

ASR Models

Dataset

Training Procedure:

Training results:

Test results:

Telegram:

Environment variables for Telegram:

Authors

Resurces

About

Releases

Packages

Contributors 2

Languages

License

Duccioo/Language-Processing-Project

Folders and files

Latest commit

History

Repository files navigation

Language Processing Project

NLP Project

ASR Models

Dataset

Training Procedure:

Training results:

Test results:

Telegram:

Environment variables for Telegram:

Authors

Resurces

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages