Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference code for video-text retrieval on MSRVTT! #18

Open
xmu-xiaoma666 opened this issue Feb 17, 2022 · 8 comments
Open

Inference code for video-text retrieval on MSRVTT! #18

xmu-xiaoma666 opened this issue Feb 17, 2022 · 8 comments

Comments

@xmu-xiaoma666
Copy link

Thank you for your great open-source code, I am excited for the outstanding zero-shot performance over video-text retrieval. Can you share the inference code for video-text retrieval on MSRVTT, thanks!

@LiJunnan1992
Copy link
Contributor

We will release code for video-text tasks soon, thanks.

@nikky4D
Copy link

nikky4D commented Feb 18, 2022

I would like to make my own video-text retrieval demo but I am not sure how to begin. Can you give me some idea of how to start? Given a text, and video, would just taking an average over the per-image embeddings be good enough?

@LiJunnan1992
Copy link
Contributor

I concatenate the frame embeddings as cross-attention input to the text encoder.

@nikky4D
Copy link

nikky4D commented Feb 18, 2022

thank you. If I understand correctly, you do the following:

  • Sample the video to get frames
  • Pass the frames through BLIP, individually, to get frame embeddings
  • Concatenate these frames (Question: concatenate into a single sequence, or stack into a block matrix)
  • Pass the concatenated frame to text encoder as blip_model(frame_embedding, text_embedding, ...)

Is this correct steps?

@tongyao-zhu
Copy link

thanks for the great work! I'm also trying to process multiple images paired with one text. However, I realise that the GPU memory is an issue when the concatenated sequence length becomes too long. Do you take all of the patches (length 197) as the frame embedding, or only the [CLS] token's feature?

@LiJunnan1992
Copy link
Contributor

thank you. If I understand correctly, you do the following:

  • Sample the video to get frames
  • Pass the frames through BLIP, individually, to get frame embeddings
  • Concatenate these frames (Question: concatenate into a single sequence, or stack into a block matrix)
  • Pass the concatenated frame to text encoder as blip_model(frame_embedding, text_embedding, ...)

Is this correct steps?

Yes those are the correct steps @nikky4D:
The frames are concatenate into a single sequence of embeddings.

@LiJunnan1992
Copy link
Contributor

thanks for the great work! I'm also trying to process multiple images paired with one text. However, I realise that the GPU memory is an issue when the concatenated sequence length becomes too long. Do you take all of the patches (length 197) as the frame embedding, or only the [CLS] token's feature?

Hi @tongyao-zhu , I took all the patches as frame embedding. The GPU memory issue is more likely due to an increase computation of the visual encoder. The memory can be reduced by sparse frame sampling or using gradient_checkpoint (changing the config file).

@LiJunnan1992
Copy link
Contributor

LiJunnan1992 commented Feb 27, 2022

Hi all, zero-shot video-text retrieval code has been added, please check the updated readme for instructions. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants