Clarification on Usage of Targets in CLIP Implementation #12

junghye01 · 2023-08-22T07:12:00Z

Hello,

I hope this message finds you well. I've been exploring the implementation of the CLIP model based on the paper 'Learning Transferable Visual Models From Natural Language Supervision'. In my review of the pseudocode provided in the paper, I noticed that the 'texts_loss' and 'images_loss' are calculated using binary matrices as targets, with values of 0 and 1. However, I observed that in the code available at this repository: https://github.com/moein-shariatnia/OpenAI-CLIP/blob/master/CLIP.py , the targets are computed by taking the softmax over the average of text similarity and image similarity.

I wanted to inquire about the rationale behind this difference in target calculation between the pseudocode in the paper and the implementation in your code. Could you kindly shed some light on why the targets are derived from the average of text and image similarities, followed by a softmax operation in the code?

I greatly appreciate your insights and understanding on this matter. I'm striving to grasp a better understanding of the implementation, and any clarification or references you could provide would be immensely helpful.

Thank you for your time and consideration.

RicRicci22 · 2023-09-11T12:24:10Z

I have the same question.. It seems to me that the possible choice is that it allows the model to not penalize completely a caption that fits bot image a and image b which is similar. The same happen viceversa for text.
My question is, is this loss meant to be used just for finetuning clip? It seems to me that if the text encoder and image encoder are trained from scratch, the initial softmax will produce targets that are shared between all the images in the batch, preventing the learning of the correct association..

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification on Usage of Targets in CLIP Implementation #12

Clarification on Usage of Targets in CLIP Implementation #12

junghye01 commented Aug 22, 2023

RicRicci22 commented Sep 11, 2023

Clarification on Usage of Targets in CLIP Implementation #12

Clarification on Usage of Targets in CLIP Implementation #12

Comments

junghye01 commented Aug 22, 2023

RicRicci22 commented Sep 11, 2023