Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
llava		llava
playground/demo		playground/demo
scripts		scripts
work_dirs/eval_video_detail_description		work_dirs/eval_video_detail_description
README.md		README.md
data		data
pyproject.toml		pyproject.toml

Repository files navigation

LLaVA-NeXT: A Strong Zero-shot Video Understanding Model

Install

Clone this repository and navigate to the LLaVA folder:

git clone https://code.byted.org/ic-research/llava-next-video.git
cd llava-next-video

Install the package:

conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip  # Enable PEP 660 support.
pip install -e .

Demo

Example model: liuhaotian/llava-v1.6-vicuna-7b
Prompt mode: vicuna_v1 (use mistral_direct for liuhaotian/llava-v1.6-34b)
Sampled frames: 32 (Defines how many frames to sample from the video.)
Spatial pooling stride: 2 (With original tokens for one frame at 24x24, if stride=2, then the tokens for one frame are 12x12.)
Local video path: ./data/llava_video/video-chatgpt/evaluation/Test_Videos/v_Lf_7RurLgp0.mp4

To run a demo, execute:

bash scripts/video/demo/video_demo.sh ${Example model} ${Prompt mode} ${Sampled frames} ${Spatial pooling stride} True ${Video path at local}

Example:

bash scripts/video/demo/video_demo.sh liuhaotian/llava-v1.6-vicuna-7b vicuna_v1 32 2 True ./data/llava_video/video-chatgpt/evaluation/Test_Videos/v_Lf_7RurLgp0.mp4

Evaluation

Preparation

Please download the evaluation data and its metadata from the following links:

video-chatgpt: here.
video_detail_description: here.
activity_qa: here and here.

Organize the downloaded data into the following structure:

LLaVA-NeXT-Video
├── llava
├── scripts
└── data
    └── llava_video
        ├── video-chatgpt
        │   ├── Test_Videos
        │   ├── consistency_qa.json
        │   ├── consistency_qa_test.json
        │   ├── consistency_qa_train.json
        ├── video_detail_description
        │   └── Test_Human_Annotated_Captions
        └── ActivityNet-QA
            ├── all_test
            ├── test_a.json
            └── test_b.json

Inference and Evaluation

Example for video detail description evaluation (additional scripts are available in scripts/eval):

bash scripts/video/eval/video_detail_description_eval_shard.sh ${Example model} ${Prompt mode} ${Sampled frames} ${Spatial pooling stride} True 8

Example:

bash scripts/eval/video_detail_description_eval_shard.sh liuhaotian/llava-v1.6-vicuna-7b vicuna_v1 32 2 True 8

GPT Evaluation Example (Optional if the above step is completed)

Assuming you have pred.json (model-generated predictions) for model llava-v1.6-vicuna-7b at ./work_dirs/eval_video_detail_description/llava-v1.6-vicuna-7b_vicuna_v1_frames_32_stride_2:

bash scripts/video/eval/video_description_eval_only.sh llava-v1.6-vicuna-7b_vicuna_v1_frames_32_stride_2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLaVA-NeXT: A Strong Zero-shot Video Understanding Model

Contents

Install

Demo

Evaluation

Preparation

Inference and Evaluation

GPT Evaluation Example (Optional if the above step is completed)

About

Releases

Packages

Contributors 9

Languages

License

LLaVA-VL/LLaVA-NeXT

Folders and files

Latest commit

History

Repository files navigation

LLaVA-NeXT: A Strong Zero-shot Video Understanding Model

Contents

Install

Demo

Evaluation

Preparation

Inference and Evaluation

GPT Evaluation Example (Optional if the above step is completed)

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 9

Languages

Packages