- This is the official repository of VALOR which provides training & testing code and pretraining checkpoints.
- VALOR-32K dataset (annotation) can be downloaded from project page. Raw videos can be downloaded from YouTube.
- VALOR-1M will be released after paper is accepted.
- VALOR is implemented based on Pytorch.
We use pytorch-1.9.0 and cuda-11.1.
Other version could be also compatible.
pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html
- build apex.
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
- install needed packages.
sh preinstall.sh
Put pretrained_weights dir under main path. (VALOR/pretrained_weights)
- pretrained_weights (BERT,CLIP,VideoSwin).
-
Put VALOR-base and VALOR-large under the output dir.
VALOR/output/VALOR-base
VALOR/output/VALOR-large
Model | Pretrained Ckpt | Finetuned Ckpt on MSRVTT-Retrieval | Finetuned Ckpt on MSRVTT-Caption |
---|---|---|---|
VALOR-B | VALOR-base | VALOR_base_msr_ret.pt | VALOR_base_msr_cap.pt |
VALOR-L | VALOR-large | VALOR_large_msr_ret.pt | VALOR_large_msr_cap.pt |
VALOR is pretrained and tested on multiple vision-language, audio-language and audiovisual-language datasets. e.g.
- PRETRAIN: VALOR-1M
- WebVid-2.5M
- CC-3M (VALOR-base)
TEST:
- VALOR-32K
- MSRVTT
- MSVD
- DiDeMo
- LSMDC
- ActivityNet
- VATEX
- AudioCaps
- ClothoV1
- TGIF-Frame
- MSCOCO
- VQAV2...
We here take MSRVTT as an example
to show the data processing procedures, other datasets take a similar way.
-
make dir
VALOR/datasets/MSRVTT
-
download raw videos from website, and put them in
MSRVTT/raw_videos
-
extract video frames (.jpg) and audio files (.wav).
- Utilizing utils/extract_frame_and_wav_multiprocess.py
(Note: VALOR use this offline extracted frames and audios for training and testing for it's fast I/O speed.
You may adjust to read raw videos via decord library, and need to change VideoMapper and AudioMapper classes in data/data.py.)
- Utilizing utils/extract_frame_and_wav_multiprocess.py
-
prepare id_files
- standardsplit_train_id.json,
- standardsplit_test_id.json,
- 1KAsplit_train_id.json,
- 1KAsplit_test_id.json).
The format is
List(Str) ['video0', 'video1', ...].
The former two are for video captioning and video qa, while the latter two are for video retrieval. -
prepare txt_mapper.json. txt_mapper files map videoIDs to its descriptions.
Format
{'video0':['desc1','desc2',...'desc20']}
.-
For VideoQA task,
the format is
{'video0':[{'question':'what color is ...?', 'answer':'red'},{'question':'Is the boy ...?', 'answer':'yes'}]}
-
prepare caption_annotation.json. : This file is used for computing caption metrics.
format:
[{'video_id':'video0','caption','A boy is ...'}, {'video_id':'video1','caption','A girl is ...'}]
-
The processed dataset path should be as follows:
├── datasets
│ ├── msrvtt
│ │ ├── raw_videos
│ │ │ ├── video0.mp4
│ │ │ └── video1.mp4
│ │ ├── frames_fps4
│ │ │ ├── video0
│ │ │ │ ├──img_0001.jpg
│ │ │ │ └──img_0002.jpg
│ │ │ └── video1
│ │ │ │ ├──img_0001.jpg
│ │ │ │ └──img_0002.jpg
│ │ ├── audio_22050hz
│ │ │ ├── video1.wav
│ │ │ └── video3.wav
│ │ ├── standardsplit_train_id.json
│ │ ├── standardsplit_test_id.json
│ │ ├── 1KAsplit_train_id.json
│ │ ├── 1KAsplit_test_id.json
│ │ ├── txt_mapper.json
│ │ ├── txt_mapper_1kAsplit_test.json
│ │ ├── txt_mapper_vqa.json
│ │ └── caption_annotation.json
We provide processed json files for most finetuneing datasets here
and you only need to download and extract raw videos of each dataset.
- finetune retrieval tasks
sh scripts/finetune_ret.sh $pretrain_path(output/VALOR_base)
What I'm interested in
- finetune captioning tasks
sh scripts/finetune_cap.sh $pretrain_path(output/VALOR_base)
- finetune QA tasks
sh scripts/finetune_qa.sh $pretrain_path(output/VALOR_base)
The finetuning output path will be the subdir of $pretrain_path
For example, the cmd for finetuning retrieval model in scripts/finetune_ret.sh
is as follows:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8 --master_port 32711 ./train.py \
--pretrain_dir $basedir \
--config ./config/fast-retrieval-msrvtt.json \
--output_dir $basedir'/ret-msrvtt-lr2e-5-bs64-epoch5' \
--learning_rate 2e-5 \
--train_video_sample_num 4 \
--test_video_sample_num 8 \
--save_best true \
if you want to test model, just add following two rows to the cmd:
--zero_shot \
--checkpoint $checkpoint_save_path(.pt)
sh scripts/pretrain.sh
VALOR's framework is easy to expand new tasks/datasets. what you need to do is
- prepare dataset as illustrated above
- write config file (copy a config file and change 'data_cfg')
- In development stage, you can simply use cmd to overwrite config file. The most important args are :
- --learning_rate
- --train_batch_size
- --train_video_sample_num
- --test_video_sample_num
- --train_audio_sample_num
- --test_audio_sample_num
- --video_resolution
- --train_epoch
- --train_task
- --test_task
- To control task and used modality group, you can rewrite train_task by
'task%modality_group1%modality_group2'
For example: finetuning text-to-audio retrieval
'ret%ta'
finetuning text-to-video retrieval'ret%tv' or 'ret%tva'
- Other settings
- --fp16 (default:
True
) - --checkpointing (default:
False
)
If you find this code useful for your research, please consider citing:
@article{chen2023valor,
title={VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset},
author={Chen, Sihan and He, Xingjian and Guo, Longteng and Zhu, Xinxin and Wang, Weining and Tang, Jinhui and Liu, Jing},
journal={arXiv preprint arXiv:2304.08345},
year={2023}
}
MIT ---