LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model [Paper]
[1/15] Our model and training codes are released.
[1/5] Our codes are currently undergoing an internal review and will be released shortly (expected next week)
- Clone this repository and navigate to LLaVA folder
git clone https://github.com/zhuyiche/llava-phi.git
cd llava-phi
- Install Package
conda create -n llava_phi python=3.10 -y
conda activate llava_phi
pip install --upgrade pip # enable PEP 660 support
pip install -e .
#Todo
#Tdo
Below is the latest training configuration for LLaVA v1.5. For legacy models, please refer to README of this version for now. We'll add them in a separate doc later.
LLaVA training consists of two stages: (1) feature alignment stage: use our 558K subset of the LAION-CC-SBU dataset to connect a frozen pretrained vision encoder to a frozen LLM; (2) visual instruction tuning stage: use 150K GPT-generated multimodal instruction-following data (with VQA data from academic-oriented tasks) to teach the model to follow multimodal instructions.
LLaVA is trained on 8 A100 GPUs with 80GB memory. To train on fewer GPUs, you can reduce the per_device_train_batch_size
and increase the gradient_accumulation_steps
accordingly. Always keep the global batch size the same: per_device_train_batch_size
x gradient_accumulation_steps
x num_gpus
.
We use a similar set of hyperparameters as Vicuna in finetuning. Both hyperparameters used in pretraining and finetuning are provided below. We note that the hyperparameters may not be the same as we reported in the arxiv paper, as this is an on-going project and we are making frequent changes on our codes.
- Pretraining
Hyperparameter | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
---|---|---|---|---|---|
LLaVA-Phi | 256 | 1e-3 | 1 | 2048 | 0 |
- Finetuning
Hyperparameter | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
---|---|---|---|---|---|
LLaVA-Phi | 128 | 2e-5 | 1 | 2048 | 0 |
Our base model phi-2, you should download the weights from here.
Please download the 558K subset of the LAION-CC-SBU dataset with BLIP captions we use in the paper here.
Training script with DeepSpeed ZeRO-2: pretrain.sh
.
--mm_projector_type mlp2x_gelu
: the two-layer MLP vision-language connector.--vision_tower openai/clip-vit-large-patch14-336
: CLIP ViT-L/14 336px.
- Prepare data
Please download the annotation of the final mixture our instruction tuning data llava_v1_5_mix665k.json, and download the images from constituting datasets:
- COCO: train2017
- GQA: images
- OCR-VQA: download script
- TextVQA: train_val_images
- VisualGenome: part1, part2
After downloading all of them, organize the data as follows in ./playground/data
,
├── coco
│ └── train2017
├── gqa
│ └── images
├── ocr_vqa
│ └── images
├── textvqa
│ └── train_images
└── vg
├── VG_100K
└── VG_100K_2
- Start training!
You may download our pretrained projectors in Model Zoo. It is not recommended to use legacy projectors, as they may be trained with a different version of the codebase, and if any option is off, the model will not function/train as we expected.
Training script with DeepSpeed ZeRO-3: finetune.sh
.
New options to note:
--mm_projector_type mlp2x_gelu
: the two-layer MLP vision-language connector.--vision_tower openai/clip-vit-large-patch14-336
: CLIP ViT-L/14 336px.--image_aspect_ratio pad
: this pads the non-square images to square, instead of cropping them; it slightly reduces hallucination.--group_by_modality_length True
: this should only be used when your instruction tuning dataset contains both language (e.g. ShareGPT) and multimodal (e.g. LLaVA-Instruct). It makes the training sampler only sample a single modality (either image or language) during training, which we observe to speed up training by ~25%, and does not affect the final outcome.
To ensure the reproducibility, we evaluate the models with greedy decoding.
See [Evaluation.md].
This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses. This project is licensed permissively under the Apache 2.0 license and does not impose any additional constraints.
If you find LLaVA-Phi useful for your research and applications, please cite using this BibTeX:
@article{zhu2024llava,
title={LLaVA-$$\backslash$phi $: Efficient Multi-Modal Assistant with Small Language Model},
author={Zhu, Yichen and Zhu, Minjie and Liu, Ning and Ou, Zhicai and Mou, Xiaofeng and Tang, Jian},
journal={arXiv preprint arXiv:2401.02330},
year={2024}
}
We build our project based on
- LLaVA: an amazing open-sourced project for vision language assistant
- LLaMA-Factory: We use this codebase to finetune Phi model