Name		Name	Last commit message	Last commit date
parent directory ..
.gitignore		.gitignore
README.md		README.md
colab_inference.ipynb		colab_inference.ipynb
configuration_indictrans.py		configuration_indictrans.py
convert_indictrans_checkpoint_to_pytorch.py		convert_indictrans_checkpoint_to_pytorch.py
example.py		example.py
install.sh		install.sh
modeling_indictrans.py		modeling_indictrans.py
train_lora.py		train_lora.py
train_lora.sh		train_lora.sh

README.md

IndicTrans2 HF Compatible Models

In this section, we provide details on how to use our IndicTrans2 models which were originally trained with the fairseq to HuggingFace transformers for inference purpose. Our scripts for HuggingFace compatible models are adapted from M2M100 repository.

Note: We have migrated IndicTrans2 tokenizer for HF compatible IndicTrans2 models to IndicTransToolkit and will be maintained separately there from now onwards. This is automatically installed when you call install.sh script in huggingface_interface.

Setup

To get started, follow these steps to set up the environment:

# Clone the github repository and navigate to the project directory.
git clone https://github.com/AI4Bharat/IndicTrans2
cd IndicTrans2/huggingface_interface

# Install all the dependencies and requirements associated with the project for running HF compatible models.
source install.sh

Note: The install.sh script in this directory is specifically for running HF compatible models for inference.

Converting

In order to convert the fairseq checkpoint to a PyTorch checkpoint that is compatible with HuggingFace Transformers, use the following command:

python3 convert_indictrans_checkpoint_to_pytorch.py --fairseq_path <fairseq_checkpoint_best.pt> --pytorch_dump_folder_path <hf_output_dir>

<fairseq_checkpoint_best.pt>: path to the fairseq checkpoint_best.pt that needs to be converted to HF compatible models
<hf_output_dir>: path to the output directory where the HF compatible models will be saved

Models

Model	🤗 HuggingFace Checkpoints
En-Indic	ai4bharat/indictrans2-en-indic-1B
Indic-En	ai4bharat/indictrans2-indic-en-1B
Distilled En-Indic	ai4bharat/indictrans2-en-indic-dist-200M
Distilled Indic-En	ai4bharat/indictrans2-indic-en-dist-200M
Indic-Indic (Stitched)	ai4bharat/indictrans2-indic-indic-1B
Distilled Indic-Indic (Stitched)	ai4bharat/indictrans2-indic-indic-dist-320M

Inference

With the conversion complete, you can now perform inference using the HuggingFace Transformers.

You can start with the provided example.py script and customize it for your specific translation use case:

python3 example.py

Feel free to modify the example.py script to suit your translation needs.

Fine-tuning with LoRA

Before starting with fine-tuning IndicTrans2 models, you will need to restructure the training data in the following format.

en-indic-exp
├── train
│   ├── eng_Latn-asm_Beng
│   │   ├── train.eng_Latn
│   │   └── train.asm_Beng
│   ├── eng_Latn-ben_Beng
│   │   └── ...
│   └── {src_lang}-{tgt_lang}
│       ├── train.{src_lang}
│       └── train.{tgt_lang}
└── dev
    ├── eng_Latn-asm_Beng
    │   ├── dev.eng_Latn
    │   └── dev.asm_Beng
    ├── eng_Latn-ben_Beng
    │   └── ...
    └── {src_lang}-{tgt_lang}
        ├── dev.{src_lang}
        └── dev.{tgt_lang}

Once you have data ready in above specified format, use the following command.

bash train_lora.sh <data_dir> <model_name> <output_dir> <direction> <src_lang_list> <tgt_lang_list>

We recommend you to refer to train_lora.sh for defaults arguments for fine-tuning. Please note that the specified hyperparameters may not be optimal and might require tuning for your use case.

Inference with LoRA

You can load the LoRA adapters with the base model for inference by modifying the model initialization in example.py script.

from transformers import AutoModelForSeq2SeqLM
from peft import PeftConfig, PeftModel

base_ckpt_dir = "ai4bharat/indictrans2-en-indic-1B" # you will need to change as per your use case
base_model = AutoModelForSeq2SeqLM.from_pretrained(base_ckpt_dir, trust_remote_code=True)
lora_model = PeftModel.from_pretrained(base_model, lora_ckpt_dir)

Note: Please feel free to open issues on the GitHub repo in case of any queries/issues.

Citation

@article{gala2023indictrans,
title={IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages},
author={Jay Gala and Pranjal A Chitale and A K Raghavan and Varun Gumma and Sumanth Doddapaneni and Aswanth Kumar M and Janki Atul Nawale and Anupama Sujatha and Ratish Puduppully and Vivek Raghavan and Pratyush Kumar and Mitesh M Khapra and Raj Dabre and Anoop Kunchukuttan},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=vfT4YuzAYA},
note={}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

huggingface_interface

huggingface_interface

README.md

IndicTrans2 HF Compatible Models

Setup

Converting

Models

Inference

Fine-tuning with LoRA

Inference with LoRA

Citation

Files

huggingface_interface

Directory actions

More options

Directory actions

More options

Latest commit

History

huggingface_interface

Folders and files

parent directory

README.md

IndicTrans2 HF Compatible Models

Setup

Converting

Models

Inference

Fine-tuning with LoRA

Inference with LoRA

Citation