-
Notifications
You must be signed in to change notification settings - Fork 592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inference code for video-text retrieval on MSRVTT! #18
Comments
We will release code for video-text tasks soon, thanks. |
I would like to make my own video-text retrieval demo but I am not sure how to begin. Can you give me some idea of how to start? Given a text, and video, would just taking an average over the per-image embeddings be good enough? |
I concatenate the frame embeddings as cross-attention input to the text encoder. |
thank you. If I understand correctly, you do the following:
Is this correct steps? |
thanks for the great work! I'm also trying to process multiple images paired with one text. However, I realise that the GPU memory is an issue when the concatenated sequence length becomes too long. Do you take all of the patches (length 197) as the frame embedding, or only the [CLS] token's feature? |
Yes those are the correct steps @nikky4D: |
Hi @tongyao-zhu , I took all the patches as frame embedding. The GPU memory issue is more likely due to an increase computation of the visual encoder. The memory can be reduced by sparse frame sampling or using gradient_checkpoint (changing the config file). |
Hi all, zero-shot video-text retrieval code has been added, please check the updated readme for instructions. Thanks! |
Thank you for your great open-source code, I am excited for the outstanding zero-shot performance over video-text retrieval. Can you share the inference code for video-text retrieval on MSRVTT, thanks!
The text was updated successfully, but these errors were encountered: