Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature extraction for image retrieval and fine tuning #25

Open
enrico310786 opened this issue Mar 5, 2022 · 3 comments
Open

Feature extraction for image retrieval and fine tuning #25

enrico310786 opened this issue Mar 5, 2022 · 3 comments

Comments

@enrico310786
Copy link

Hi, congratulation for the results.

My questions are about a correct use of the exit features for the retrieval task and of the finetuning phase.

  1. In the colab notebook, on the section 'Feature extraction' the model has three possible outputs: multimodal_feature, image_feature and text_feature. Are they the outputs of the, respectively, image-grounded text encoder, image encoder and text encoder? Thus, if i want to check if two images+text are similar i have to measure the distances from their multimodal_features, right? If, on the other hand, I just want to verify the similarity between only images or texts, i have to use just the image_feature or text_feature, right?
  2. To perform the feature extraction i see that the colab notebook uses the model_base.pth. May I use the already finetuned models for Image-Text Retrieval (COCO) in BLIP w/ ViT-B or BLIP w/ ViT-L? Are the able to extract the multimodal_feature, image_feature and text_feature in the same way of the model_base?
  3. Instead of fine tuning the Image-Text Retrieval model on COCO dataset, may i use the same script but another dataset, for example with the image caption in italian language instead of english? If the language changes, it is necessary to change parameters in the BERT language model or the model adapts to the Italian language during fine tuning?

Many thanks, Enrico

@LiJunnan1992
Copy link
Contributor

Hi, thank for your questions:

  1. About computing image-text similarity, we provide two ways in the colab notebook: (a) using multimodal feature + image-text matching (ITM) head, (b) using unimodal features and computing their cosine similarity.
  2. Yes you can use the finetuned models.
  3. You can still use our pre-trained BERT, but the tokenzier may not be the most suitable for Italian, and the model is also not pre-trained for non-english languages.

@enrico310786
Copy link
Author

enrico310786 commented Mar 6, 2022

Hi, thank for the hint.

If I want to use an Italian BERT with the correct Italian tokenizer, in order to pre-train the model with the Pre-training datasets (with the sentences translate in italian language), could i use the pretrain script?

python -m torch.distributed.run --nproc_per_node=8 pretrain.py --config ./configs/Pretrain.yaml --output_dir output/Pretrain

After that, i could fine tune the pretrained model on the COCO dataset always with the sentences translate in italian language. Could it make sense?

@LiJunnan1992
Copy link
Contributor

Yes you could do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants