Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification on Usage of Targets in CLIP Implementation #12

Open
junghye01 opened this issue Aug 22, 2023 · 1 comment
Open

Clarification on Usage of Targets in CLIP Implementation #12

junghye01 opened this issue Aug 22, 2023 · 1 comment

Comments

@junghye01
Copy link

Hello,

I hope this message finds you well. I've been exploring the implementation of the CLIP model based on the paper 'Learning Transferable Visual Models From Natural Language Supervision'. In my review of the pseudocode provided in the paper, I noticed that the 'texts_loss' and 'images_loss' are calculated using binary matrices as targets, with values of 0 and 1. However, I observed that in the code available at this repository: https://github.com/moein-shariatnia/OpenAI-CLIP/blob/master/CLIP.py , the targets are computed by taking the softmax over the average of text similarity and image similarity.

I wanted to inquire about the rationale behind this difference in target calculation between the pseudocode in the paper and the implementation in your code. Could you kindly shed some light on why the targets are derived from the average of text and image similarities, followed by a softmax operation in the code?

I greatly appreciate your insights and understanding on this matter. I'm striving to grasp a better understanding of the implementation, and any clarification or references you could provide would be immensely helpful.

Thank you for your time and consideration.

@RicRicci22
Copy link

I have the same question.. It seems to me that the possible choice is that it allows the model to not penalize completely a caption that fits bot image a and image b which is similar. The same happen viceversa for text.
My question is, is this loss meant to be used just for finetuning clip? It seems to me that if the text encoder and image encoder are trained from scratch, the initial softmax will produce targets that are shared between all the images in the batch, preventing the learning of the correct association..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants