- Clone this repository and navigate to the LLaVA folder:
git clone https://code.byted.org/ic-research/llava-next-video.git
cd llava-next-video
- Install the package:
conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip # Enable PEP 660 support.
pip install -e .
-
Example model:
liuhaotian/llava-v1.6-vicuna-7b
-
Prompt mode:
vicuna_v1
(usemistral_direct
forliuhaotian/llava-v1.6-34b
) -
Sampled frames:
32
(Defines how many frames to sample from the video.) -
Spatial pooling stride:
2
(With original tokens for one frame at 24x24, if stride=2, then the tokens for one frame are 12x12.) -
Local video path:
./data/llava_video/video-chatgpt/evaluation/Test_Videos/v_Lf_7RurLgp0.mp4
To run a demo, execute:
bash scripts/video/demo/video_demo.sh ${Example model} ${Prompt mode} ${Sampled frames} ${Spatial pooling stride} True ${Video path at local}
Example:
bash scripts/video/demo/video_demo.sh liuhaotian/llava-v1.6-vicuna-7b vicuna_v1 32 2 True ./data/llava_video/video-chatgpt/evaluation/Test_Videos/v_Lf_7RurLgp0.mp4
Please download the evaluation data and its metadata from the following links:
Organize the downloaded data into the following structure:
LLaVA-NeXT-Video
├── llava
├── scripts
└── data
└── llava_video
├── video-chatgpt
│ ├── Test_Videos
│ ├── consistency_qa.json
│ ├── consistency_qa_test.json
│ ├── consistency_qa_train.json
├── video_detail_description
│ └── Test_Human_Annotated_Captions
└── ActivityNet-QA
├── all_test
├── test_a.json
└── test_b.json
Example for video detail description evaluation (additional scripts are available in scripts/eval
):
bash scripts/video/eval/video_detail_description_eval_shard.sh ${Example model} ${Prompt mode} ${Sampled frames} ${Spatial pooling stride} True 8
Example:
bash scripts/eval/video_detail_description_eval_shard.sh liuhaotian/llava-v1.6-vicuna-7b vicuna_v1 32 2 True 8
Assuming you have pred.json
(model-generated predictions) for model llava-v1.6-vicuna-7b
at ./work_dirs/eval_video_detail_description/llava-v1.6-vicuna-7b_vicuna_v1_frames_32_stride_2
:
bash scripts/video/eval/video_description_eval_only.sh llava-v1.6-vicuna-7b_vicuna_v1_frames_32_stride_2