Skip to content

kotoba-tech/kotoba-whisper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

70 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Kotoba-Whisper Training

Reproducing Kotoba-whisper models requires following five stages to be completed in successive order:

  1. Setup
  2. Download Dataset
  3. Generate Labels
  4. Filter Dataset
  5. Initialize Distil-Whisper
  6. Train Model
  7. Evaluate Model

To reproduce kotoba-whisper models, please refer the following scripts:

1. Setup

Clone the repo and configure your huggingface environment.

  • pip install
git clone [email protected]:kotoba-tech/kotoba-whisper.git
cd kotoba-whisper
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
  • huggingface configuration
accelerate config
huggingface-cli login

2. Download Dataset

Although ReazonSpeech is available on huggingface, the repository has stability issues (raising TimeOutError for the larger subsets such as large or all), so we instead download the source files locally first, and use our manual data loader to load
the dataset in the next label generation step. The python script run_pseudo_labelling.py downloads the source files of ReazonSpeech locally.

python reazonspeech_manual_downloader.py [--target TARGET] [-p POOL] [-s START_QUE] [-e END_QUE]

The argument --target should be either of tiny/small/medium/large/all. The full dataset all is very large, so you may want to download by small chunks, which can be done by specifying the index of raw files by --start and --end such as below.

python reazonspeech_manual_downloader.py --target all -p 100 -s 0 -e 50 

ReazonSpeech all has 4096 files in total, and we use the last file to create our held-out test set, so we ran the above command until -e 4095 with reasonable chunk size (we set 50).

3. Generate Labels by Teacher Whisper Model

The python script run_pseudo_labelling.py is a flexible inference script that can be used to generate pseudo-labels under a range of settings, including using both greedy and beam-search. To generate labels from the teacher model on the locally downloaded ReazonSpeech dataset, run the following command, which generates labels on all the audio and upload to the huggingface hub in the audio dataset format to where specified by --hub_model_id.

accelerate launch run_pseudo_labelling.py \
  --model_name_or_path "openai/whisper-large-v3" \
  --dataset_name "${PWD}/reazonspeech_manual_dataloader.py" \
  --dataset_config_name "tiny" \
  --dataset_split_name "train" \
  --text_column_name "transcription" \
  --id_column_name "name" \
  --per_device_eval_batch_size 4 \
  --dataloader_num_workers 32 \
  --preprocessing_num_workers 32 \
  --logging_steps 100 \
  --max_label_length 128 \
  --language "ja" \
  --return_timestamps \
  --attn_type "flash_attn" \
  --generation_num_beams 1 \
  --decode_token_ids False \
  --overwrite_output_dir \
  --output_dir "output" \
  --wandb_project "wandb" \
  --hub_model_id "{your-hf-org}/{your-dataset-name}"

Note that we use our custom data loader, but any huggingface audio datasets can be used in the above script.

4. Filter Dataset

The original distil-whisper paper proposed to filter the dataset based on the word error rate (WER) between the reference and the predicted transcription to retain the quality of the dataset for the distillation. We also follow the filtering procedure and drop the dataset if the WER is more than 10%. The following script will take the dataset with the whisper label generated by the previous step, drop those with WER more than 10%, transform the wave signal to Mel spectrogram, and upload the dataset to the huggingface hub in the audio dataset format.

python run_data_filtering.py \
  -d "your-hf-org/dataset_name" \
  --dataset_config_name "tiny" \
  --wer_threshold 10 \
  --text_column_name "transcription" \
  --preprocessing_num_workers 64 \
  --max_label_length 128

5. Initialize Distil-Whisper

The script create_student_model.py can be used to initialise a small student model from a large teacher model. When initialising a student model with fewer layers than the teacher model, the student is initialised by copying maximally spaced layers from the teacher, as per the DistilBart recommendations. First, we need to create a model repository on the Hugging Face Hub. This repository will contain all the required files to reproduce the training run, alongside model weights, training logs and a README.md card. You can either create a model repository directly on the Hugging Face Hub using the link: https://huggingface.co/new. Or, via the CLI, as we'll show here.

huggingface-cli repo create {your-hf-org}/{your-model-name}

Let's clone the repository so that we can place our training script and model weights inside:

git lfs install
git clone "https://huggingface.co/{your-hf-org}/{your-model-name}"

We can now copy the relevant training scrips to the repository:

cp create_student_model.py {your-hf-org}/{your-model-name}
cp run_distillation.py {your-hf-org}/{your-model-name}
cd {your-hf-org}/{your-model-name} || exit

The following command demonstrates how to initialise a student model from the Whisper checkpoint, with all 32 encoder layer and 2 decoder layers. The 2 student decoder layers are copied from teacher layers 1 and 32 respectively, as the maximally spaced layers:

python create_student_model.py \
  --teacher_checkpoint "openai/whisper-large-v3" \
  --encoder_layers 32 \
  --decoder_layers 2 \
  --save_dir "{your-hf-org}/{your-model-name}-init"

The initialised model will be saved to the sub-directory in our model repository.

6. Train Model

The script run_distillation.py is an end-to-end script for loading multiple datasets, a student model, a teacher model, and performing teacher-student distillation. It uses the loss formulation from the Distil-Whisper paper, which is a weighted sum of the cross-entropy and KL-divergence loss terms. The following command takes the ReazonSpeech dataset that was pseudo-labelled in the first stage and trains the 2-layer decoder model initialized in the previous step.

accelerate launch run_distillation.py \
  --model_name_or_path "{your-hf-org}/{your-model-name}-init" \
  --teacher_model_name_or_path "openai/whisper-large-v3" \
  --train_dataset_name "{your-hf-org}/{your-dataset-name}.wer_10.0.vectorized" \
  --train_dataset_config_name "tiny" \
  --language "ja" \
  --max_label_length 128 \
  --train_split_name "train" \
  --save_steps 2500 \
  --warmup_steps "50" \
  --learning_rate 0.0001 \
  --lr_scheduler_type "constant_with_warmup" \
  --logging_steps 50 \
  --save_total_limit 1 \
  --per_device_train_batch_size 16 \
  --gradient_accumulation_steps 2 \
  --preprocessing_num_workers 64 \
  --dataloader_num_workers 1 \
  --dtype "bfloat16" \
  --output_dir "./" \
  --wandb_project "wandb" \
  --gradient_checkpointing \
  --freeze_encoder \
  --push_to_hub \
  --do_train \
  --overwrite_output_dir \
  --num_train_epochs 8

7. Evaluate Model

We evaluate our models for the short form evaluation on audio samples less than 30s in duration. The script
run_short_form_eval.py can be used to run the evaluation for an audio-transcription paired dataset. Following example runs evaluation on japanese-asr/ja_asr.reazonspeech_test, the held-out test split from ReazonSpeech.

python run_eval_pipeline.py -m "{your-hf-org}/{your-model-name}" -d "japanese-asr/ja_asr.reazonspeech_test"

Ablation Study

While developing kotoba-whisper models, we have experimented with different split of ReazonSpeech for distillation, and all the models and datasets for such ablation study can be found at https://huggingface.co/japanese-asr. Following tables are summaries of WER and CER for the distil-whisper model on different size of ReazonSpeech against OpenAI whisper models (the model names follow distil-whisper-large-v3-ja-reazonspeech-{size of reazonspeech}).

WER common_voice_8_0 jsut_basic5000 reazonspeech_test
japanese-asr/distil-whisper-large-v3-ja-reazonspeech-large 59.27 64.36 56.62
japanese-asr/distil-whisper-large-v3-ja-reazonspeech-medium 64.38 72.02 62.99
japanese-asr/distil-whisper-large-v3-ja-reazonspeech-small 85.1 94.18 82.18
japanese-asr/distil-whisper-large-v3-ja-reazonspeech-tiny 99.96 100 99.05
openai/whisper-large-v3 55.41 59.34 60.23
openai/whisper-medium 63.64 69.52 76.04
openai/whisper-small 74.21 82.02 82.99
openai/whisper-tiny 93.78 97.72 94.85
CER common_voice_8_0 jsut_basic5000 reazonspeech_test
japanese-asr/distil-whisper-large-v3-ja-reazonspeech-large 9.44 8.48 12.6
japanese-asr/distil-whisper-large-v3-ja-reazonspeech-medium 10.89 11.25 16.37
japanese-asr/distil-whisper-large-v3-ja-reazonspeech-small 30.48 38.96 42.29
japanese-asr/distil-whisper-large-v3-ja-reazonspeech-tiny 94.69 95.32 95.82
openai/whisper-large-v3 8.52 7.18 15.18
openai/whisper-medium 11.34 9.87 29.56
openai/whisper-small 15.26 14.22 34.29
openai/whisper-tiny 46.86 35.69 96.69

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published