-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarification on Usage of Targets in CLIP Implementation #12
Comments
I have the same question.. It seems to me that the possible choice is that it allows the model to not penalize completely a caption that fits bot image a and image b which is similar. The same happen viceversa for text. |
Hello,
I hope this message finds you well. I've been exploring the implementation of the CLIP model based on the paper 'Learning Transferable Visual Models From Natural Language Supervision'. In my review of the pseudocode provided in the paper, I noticed that the 'texts_loss' and 'images_loss' are calculated using binary matrices as targets, with values of 0 and 1. However, I observed that in the code available at this repository: https://github.com/moein-shariatnia/OpenAI-CLIP/blob/master/CLIP.py , the targets are computed by taking the softmax over the average of text similarity and image similarity.
I wanted to inquire about the rationale behind this difference in target calculation between the pseudocode in the paper and the implementation in your code. Could you kindly shed some light on why the targets are derived from the average of text and image similarities, followed by a softmax operation in the code?
I greatly appreciate your insights and understanding on this matter. I'm striving to grasp a better understanding of the implementation, and any clarification or references you could provide would be immensely helpful.
Thank you for your time and consideration.
The text was updated successfully, but these errors were encountered: