A very small man can cast a very large shadow.
——George R.R. Martin, A Clash of Kings
[Technical report (coming soon)] [Demo] [Huggingface]
This repository contains the official training/evaluation code of the Imp project, which aims to provide a family of a strong multimodal small language models (MSLMs). Our imp-v1-3b
is a strong MSLM with only 3B parameters, which is build upon a small yet powerful SLM Phi-2 (2.7B) and a powerful visual encoder SigLIP (0.4B), and trained on the LLaVA-v1.5 training set.
As shown in the Evaluation, imp-v1-3b
significantly outperforms the counterparts of similar model sizes, and even achieves slightly better performance than the strong LLaVA-7B model on various multimodal benchmarks.
We also release the model weights a running example of imp-v1-3b
on Huggingface. Technical report will be released soon. We will persistently improve our model and release the next versions to further improve model performance :)
- February 9, 2024: Training and evaluation codes of
imp-v1-3b
are released.
- Clone this repository and navigate to the folder
git clone https://github.com/MILVLG/imp.git
cd imp
- Install Package
We recommend using Anaconda to create a new environment for the project, and install the requirements with the following commands:
conda create -n imp python=3.10 -y
conda activate imp
pip install -r requirements.txt
pip install flash-attn==2.4.2 --no-build-isolation
- (Optional) Manually download the pretrained model repositories, i.e., google/siglip-so400m-patch14-384 and microsoft/phi-2 to your local directories and modify the corresponding paths in the training and evaluation scripts, respectively.
The training pipeline and datasets of imp-v1-3b
are directly inherited from LLaVA-v1.5. The training
- Multimodal pretraining: train a projector on a subset of ∼558K image-text pairs to connect a frozen pretrained vision encoder and a frozen LLM.
- Multimodal instruction tuning: fine-tune the projector and LoRA in the LLM with multimodal instruction data and VQA-formatted data to empower the MLSM the ability of multimodal instruction following.
Imp is trained on 8 A100 (40G) GPUs. You can reduce the per_device_train_batch_size
and increase the gradient_accumulation_steps
to match your resources. .But always keep the global batch size the same: global_batch_size
= per_device_train_batch_size
gradient_accumulation_steps
num_gpus
.
Training scripts
Please download the caption annotations blip_laion_cc_sbu_558k.json
and images from here. Move the downloaded files to the ./datasets
folder, with image folder unzipped and renamed to pretrain_images
. Then run the following command to start the training process:
bash scripts/pretrain.sh
After that, a checkpoint file of the multimodal projector will be stored in ./checkpoints/imp-v1-3b-pretrain
.
Please download the annotation file of the mixed instruction tuning data llava_v1_5_mix665k.json, and download the images from constituting datasets:
- COCO: train2017
- GQA: images
- OCR-VQA: download script, save all files as
.jpg
- TextVQA: train_val_images
- VisualGenome: part1, part2
After downloading all of them, organize the data as follows:
datasets
├── llava_v1_5_mix665k.json
└── finetune_images
├── coco
│ └── train2017
├── gqa
│ └── images
├── ocr_vqa
│ └── images
├── textvqa
│ └── train_images
└── vg
├── VG_100K
└── VG_100K_2
Then, you can start the training process with the following command:
bash scripts/finetune_lora.sh
# bash scripts/finetune.sh # fully finetuning is not recommended
You will get a trained model (a LoRA diff if you use finetune_lora.sh
) under ./checkpoints/
when the training is done.
We follow the evaluation of LLaVA-v1.5 and conduct experiments on 9 commonly-used benchmarks, including 5 academic VQA benchmarks and 4 popular MLLM benchmarks. All evaluation scripts are placed in the scripts/eval
folder.
Before preparing task-specific data, you should download eval.zip and unzip it to ./playground/data/eval
. For more specific instructions, please refer to LLaVA's Evaluation.md.
It is supported to evaluate your own trained model checkpoints or our released imp-v1-3b
model at Huggingface Hub.
For more detailed evaluation scripts, please See Evaluation.md.
Using the provided model ckpts, you can reproduce the following results. Our imp-v1-3b
model significantly outperforms existing MSLMs of similar model sizes, and is comparable with the strong LLaVA-v1.5-7b model.
Models | VQAv2 | GQA | VizWiz | SQA(IMG) | TextVQA | POPE | MME(P) | MMB | MM-Vet |
---|---|---|---|---|---|---|---|---|---|
LLaVA-v1.5-lora (7B) | 79.10 | 63.00 | 47.80 | 68.40 | 58.20 | 86.40 | 1476.9 | 66.10 | 30.2 |
TinyGPT-V (3B) | - | 33.60 | 24.80 | - | - | - | - | - | - |
LLaVA-Phi (3B) | 71.40 | - | 35.90 | 68.40 | 48.60 | 85.00 | 1335.1 | 59.80 | 28.9 |
MobileVLM (3B) | - | 59.00 | - | 61.00 | 47.50 | 84.90 | 1288.9 | 59.60 | - |
MC-LLaVA (3B) | 64.24 | 49.60 | 24.88 | - | 38.59 | 80.59 | - | - | - |
Imp-v1 (3B, ours) | 79.45 | 58.55 | 50.09 | 69.96 | 59.38 | 88.02 | 1434.0 | 66.49 | 33.1 |
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
This project is maintained by the MILVLG@Hangzhou Dianzi University (HDU) led by Prof. Zhou Yu and Jun Yu, and is mainly developed by Zhenwei Shao and Xuecheng Ouyang. We hope our model may serve as a strong baseline to inspire future research on MSLM, as well as its derivative applications on mobile devices and robots.
If you use our model or refer our work in your studies, please cite:
@misc{imp2024,
author = {Shao, Zhenwei and Ouyang, Xuecheng and Gai, Zhenbiao and Yu, Zhou and Yu, Jun},
title = {Imp: An emprical study of multimodal small language models},
year = {2024},
url = {https://huggingface.co/MILVLG/imp-v1-3b}
}