Reproducing the pretrain results on COCO+VG +CC+SBU #124

dyashuni · 2023-01-11T06:43:33Z

Hi @LiJunnan1992, thank you for great work!

I'm trying to reproduce the pretraining on the CC + COCO + SBU + VG dataset. I get higher losses than yours reported here #19 (comment)
I use the following dataset:

COCO (https://storage.googleapis.com/sfr-vision-language-research/datasets/coco_karpathy_train.json)
Visual genome (https://storage.googleapis.com/sfr-vision-language-research/datasets/vg_caption.json)
CC3M+CC12M+SBU: Filtered web caption (https://storage.googleapis.com/sfr-vision-language-research/BLIP/datasets/ccs_filtered.json)
CC3M+CC12M+SBU: Filtered synthetic caption by ViT-L (https://storage.googleapis.com/sfr-vision-language-research/BLIP/datasets/ccs_synthetic_filtered_large.json)

I didn't balance these datasets. I took the pretrain yaml config from here https://github.com/salesforce/BLIP/blob/main/configs/pretrain.yaml and added new datasets to the training.

Could you please share your yaml config for pretraining on the CC + COCO + SBU + VG dataset ?

LiJunnan1992 · 2023-01-11T09:17:26Z

Hi @dyashuni, thanks for your interest.
Could you take a look at our LAVIS library? https://github.com/salesforce/LAVIS
It supports BLIP pre-training among other functions.

dyashuni · 2023-01-11T16:42:04Z

@LiJunnan1992 thank you, I will take a look at LAVIS

dyashuni · 2023-01-13T14:04:28Z

Hi @LiJunnan1992 !
I finetuned 3 pretrained models on a COCO caption task using train_caption.py. I used 32 GPU for pretraining.

BLIP w/ ViT-B 14M https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_14M.pth
BLIP w/ ViT-B 129M https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base.pth
BLIP w/ ViT-B and CapFilt-L 129M https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_capfilt_large.pth

And got the following metrics:

BLIP w/ ViT-B 14M images: "val_Bleu_4": 0.403 "val_CIDEr": 1.324
BLIP w/ ViT-B 129M images: "val_Bleu_4": 0.397 "val_CIDEr": 1.318
BLIP w/ ViT-B and CapFilt-L 129M images: "val_Bleu_4": 0.403 "val_CIDEr": 1.338

Model BLIP w/ ViT-B 14M performs almost the same as BLIP w/ ViT-B and CapFilt-L 129M. That contradicts published results...

How is it possible?

LiJunnan1992 · 2023-01-13T14:22:01Z

Could you reproduce BLIP's fine-tuning result if you use the same setting?
" I used 32 GPU for pretraining." -> I assume that you mean "finetuning"?

dyashuni · 2023-01-13T14:42:48Z

I used your config caption_coco.yaml for finetuning. So I used your params.
How many GPUs did you use for finetuning?

" I used 32 GPU for pretraining." -> I assume that you mean "finetuning"? Yes, thank you

LiJunnan1992 · 2023-01-13T14:45:10Z

I used 8 GPUs. With 32 GPUs, you should set batch_size=8 so that the total batch size remains 256.

dyashuni · 2023-01-13T14:46:02Z

Thank you! I will try it.

dyashuni closed this as completed Feb 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducing the pretrain results on COCO+VG +CC+SBU #124

Reproducing the pretrain results on COCO+VG +CC+SBU #124

dyashuni commented Jan 11, 2023

LiJunnan1992 commented Jan 11, 2023

dyashuni commented Jan 11, 2023

dyashuni commented Jan 13, 2023 •

edited

Loading

LiJunnan1992 commented Jan 13, 2023

dyashuni commented Jan 13, 2023

LiJunnan1992 commented Jan 13, 2023

dyashuni commented Jan 13, 2023

Reproducing the pretrain results on COCO+VG +CC+SBU #124

Reproducing the pretrain results on COCO+VG +CC+SBU #124

Comments

dyashuni commented Jan 11, 2023

LiJunnan1992 commented Jan 11, 2023

dyashuni commented Jan 11, 2023

dyashuni commented Jan 13, 2023 • edited Loading

LiJunnan1992 commented Jan 13, 2023

dyashuni commented Jan 13, 2023

LiJunnan1992 commented Jan 13, 2023

dyashuni commented Jan 13, 2023

dyashuni commented Jan 13, 2023 •

edited

Loading