Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducing the pretrain results on COCO+VG +CC+SBU #124

Closed
dyashuni opened this issue Jan 11, 2023 · 7 comments
Closed

Reproducing the pretrain results on COCO+VG +CC+SBU #124

dyashuni opened this issue Jan 11, 2023 · 7 comments

Comments

@dyashuni
Copy link

Hi @LiJunnan1992, thank you for great work!

I'm trying to reproduce the pretraining on the CC + COCO + SBU + VG dataset. I get higher losses than yours reported here #19 (comment)
I use the following dataset:

  1. COCO (https://storage.googleapis.com/sfr-vision-language-research/datasets/coco_karpathy_train.json)
  2. Visual genome (https://storage.googleapis.com/sfr-vision-language-research/datasets/vg_caption.json)
  3. CC3M+CC12M+SBU: Filtered web caption (https://storage.googleapis.com/sfr-vision-language-research/BLIP/datasets/ccs_filtered.json)
  4. CC3M+CC12M+SBU: Filtered synthetic caption by ViT-L (https://storage.googleapis.com/sfr-vision-language-research/BLIP/datasets/ccs_synthetic_filtered_large.json)

I didn't balance these datasets. I took the pretrain yaml config from here https://github.com/salesforce/BLIP/blob/main/configs/pretrain.yaml and added new datasets to the training.

Could you please share your yaml config for pretraining on the CC + COCO + SBU + VG dataset ?

@LiJunnan1992
Copy link
Contributor

Hi @dyashuni, thanks for your interest.
Could you take a look at our LAVIS library? https://github.com/salesforce/LAVIS
It supports BLIP pre-training among other functions.

@dyashuni
Copy link
Author

@LiJunnan1992 thank you, I will take a look at LAVIS

@dyashuni
Copy link
Author

dyashuni commented Jan 13, 2023

Hi @LiJunnan1992 !
I finetuned 3 pretrained models on a COCO caption task using train_caption.py. I used 32 GPU for pretraining.

  1. BLIP w/ ViT-B 14M https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_14M.pth
  2. BLIP w/ ViT-B 129M https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base.pth
  3. BLIP w/ ViT-B and CapFilt-L 129M https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_capfilt_large.pth

And got the following metrics:

  1. BLIP w/ ViT-B 14M images: "val_Bleu_4": 0.403 "val_CIDEr": 1.324
  2. BLIP w/ ViT-B 129M images: "val_Bleu_4": 0.397 "val_CIDEr": 1.318
  3. BLIP w/ ViT-B and CapFilt-L 129M images: "val_Bleu_4": 0.403 "val_CIDEr": 1.338

Model BLIP w/ ViT-B 14M performs almost the same as BLIP w/ ViT-B and CapFilt-L 129M. That contradicts published results...

How is it possible?

@LiJunnan1992
Copy link
Contributor

Could you reproduce BLIP's fine-tuning result if you use the same setting?
" I used 32 GPU for pretraining." -> I assume that you mean "finetuning"?

@dyashuni
Copy link
Author

I used your config caption_coco.yaml for finetuning. So I used your params.
How many GPUs did you use for finetuning?

" I used 32 GPU for pretraining." -> I assume that you mean "finetuning"? Yes, thank you

@LiJunnan1992
Copy link
Contributor

I used 8 GPUs. With 32 GPUs, you should set batch_size=8 so that the total batch size remains 256.

@dyashuni
Copy link
Author

Thank you! I will try it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants