Skip to content

AI4Bharat/IndicVoices

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Paper Data

IndicVoices

We present INDICVOICES, a dataset of natural and spontaneous speech containing a total of 12000 hours of read (8%), extempore (76%) and conversational (15%) audio from 22563 speakers covering 208 Indian districts and 22 languages. Of these 12000 hours, 3200 hours have already been transcribed, with a median of 122 hours per language. Through this paper, we share our journey of capturing the cultural, linguistic and demographic diversity of India to create a one-of-its-kind inclusive and representative dataset. More specifically, we share an open-source blueprint for data collection at scale comprising of standardised protocols, centralised tools, a repository of engaging questions, prompts and conversation scenarios spanning multiple domains and topics of interest, quality control mechanisms, comprehensive transcription guidelines and transcription tools. We hope that this open source blueprint will serve as a comprehensive starter kit for data collection efforts in other multilingual regions of the world. Using INDICVOICES, we build IndicASR, the first ASR model to support all the 22 languages listed in the 8th schedule of the Constitution of India.

Explore and Download IndicVoices https://ai4bharat.iitm.ac.in/indicvoices

IndicVoices paper - https://arxiv.org/abs/2403.01926

Collection Platform

Kathbath is an open-source, crowdsourcing framework designed to facilitate the large-scale collection of audio data. It includes an Android frontend that allows users to easily complete microtasks on their phones. Kathbath is built on top of Karya, Microsoft's open-source crowdsourcing platform toolkit.

https://github.com/AI4Bharat/Karya

Transcription Platform

Shoonya is an open source platform to annotate and label data at scale, built with a vision to enhance digital presence of under-represented languages in India. Shoonya offers support for multiple data types (Ex : parallel datasets, OCR, ASR, TTS etc) and labeling tasks (Ex : parallel datasets, OCR, ASR, TTS etc).

https://github.com/AI4Bharat/Shoonya

IndicASR

Checkpoint:

IndicASR: A multilingual model trained on IndicVoices.

Data Preparation

  1. After downloading the data, extract the tar files so that TGZ folder and language specific folders are on similar level

    ROOT_FOLDER
     |- TGZ
     |- Assamese
     |- Nepali
     |- ...
     |- ...
     |- Kashmiri
    
  2. Run the following command to downsample the audios to 16kHz. (Use $NCPUS to control the parallelism)

    find . -type f \( -name "*.wav" \) -print0 | xargs -0 -I {} -P $NCPUS bash -c 'ffmpeg -y -loglevel warning -hide_banner -stats -i $1 -ar $2 -ac $3 "${1%.*}_${2}.wav" && rm $1 && mv "${1%.*}_${2}.wav" $1' -- {} 16000 1
  3. Run create_indicvoices.py to build a chunked version of the IndicVoices.

     python create_indicvoices.py /path/to/input/directory /path/to/output/directory 
  4. Run create_manifest.sh to create manifest files from the processed dataset.

     bash create_manifest.sh /path/to/folder/containing/language/wise/data/folders

Model Training

  1. Install Nemo

  2. Training command

    python ${RUNNER_PATH}/speech_to_text_hybrid_rnnt_ctc_bpe.py \
     --config-path=${CONFIG_PATH} \
     --config-name=${CONFIG_NAME} \
     trainer.devices=-1 \
     trainer.accelerator="gpu" \
     trainer.strategy="ddp_find_unused_parameters_true" \
     exp_manager.name=<exp_name> \
     exp_manager.wandb_logger_kwargs.name=<exp_name> \
     exp_manager.explicit_log_dir=$LOG_DIR 
    
  3. Inference

    OMP_NUM_THREADS=64 python ${RUNNER_PATH}/transcribe_speech.py \
         model_path=$MODEL_PATH \
         dataset_manifest=$MANIFEST_PATH.json \
         output_filename=$SAVE_FILE_NAME \
         langid=$LANGID \
         batch_size=64 \
         compute_timestamps=False \
         compute_langs=False \
         cuda=$GPU_ID \
         amp=True \
         append_pred=False 
    

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages