Inference code for video-text retrieval on MSRVTT! #18

xmu-xiaoma666 · 2022-02-17T06:46:52Z

Thank you for your great open-source code, I am excited for the outstanding zero-shot performance over video-text retrieval. Can you share the inference code for video-text retrieval on MSRVTT, thanks!

LiJunnan1992 · 2022-02-17T08:40:44Z

We will release code for video-text tasks soon, thanks.

nikky4D · 2022-02-18T04:57:10Z

I would like to make my own video-text retrieval demo but I am not sure how to begin. Can you give me some idea of how to start? Given a text, and video, would just taking an average over the per-image embeddings be good enough?

LiJunnan1992 · 2022-02-18T08:21:29Z

I concatenate the frame embeddings as cross-attention input to the text encoder.

nikky4D · 2022-02-18T13:26:49Z

thank you. If I understand correctly, you do the following:

Sample the video to get frames
Pass the frames through BLIP, individually, to get frame embeddings
Concatenate these frames (Question: concatenate into a single sequence, or stack into a block matrix)
Pass the concatenated frame to text encoder as blip_model(frame_embedding, text_embedding, ...)

Is this correct steps?

tongyao-zhu · 2022-02-19T01:12:08Z

thanks for the great work! I'm also trying to process multiple images paired with one text. However, I realise that the GPU memory is an issue when the concatenated sequence length becomes too long. Do you take all of the patches (length 197) as the frame embedding, or only the [CLS] token's feature?

LiJunnan1992 · 2022-02-19T02:01:02Z

thank you. If I understand correctly, you do the following:

Sample the video to get frames

Pass the frames through BLIP, individually, to get frame embeddings

Concatenate these frames (Question: concatenate into a single sequence, or stack into a block matrix)

Pass the concatenated frame to text encoder as blip_model(frame_embedding, text_embedding, ...)

Is this correct steps?

Yes those are the correct steps @nikky4D:
The frames are concatenate into a single sequence of embeddings.

LiJunnan1992 · 2022-02-19T02:03:37Z

thanks for the great work! I'm also trying to process multiple images paired with one text. However, I realise that the GPU memory is an issue when the concatenated sequence length becomes too long. Do you take all of the patches (length 197) as the frame embedding, or only the [CLS] token's feature?

Hi @tongyao-zhu , I took all the patches as frame embedding. The GPU memory issue is more likely due to an increase computation of the visual encoder. The memory can be reduced by sparse frame sampling or using gradient_checkpoint (changing the config file).

LiJunnan1992 · 2022-02-27T02:00:42Z

Hi all, zero-shot video-text retrieval code has been added, please check the updated readme for instructions. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference code for video-text retrieval on MSRVTT! #18

Inference code for video-text retrieval on MSRVTT! #18

xmu-xiaoma666 commented Feb 17, 2022

LiJunnan1992 commented Feb 17, 2022

nikky4D commented Feb 18, 2022

LiJunnan1992 commented Feb 18, 2022

nikky4D commented Feb 18, 2022

tongyao-zhu commented Feb 19, 2022

LiJunnan1992 commented Feb 19, 2022

LiJunnan1992 commented Feb 19, 2022

LiJunnan1992 commented Feb 27, 2022 •

edited

Loading

Inference code for video-text retrieval on MSRVTT! #18

Inference code for video-text retrieval on MSRVTT! #18

Comments

xmu-xiaoma666 commented Feb 17, 2022

LiJunnan1992 commented Feb 17, 2022

nikky4D commented Feb 18, 2022

LiJunnan1992 commented Feb 18, 2022

nikky4D commented Feb 18, 2022

tongyao-zhu commented Feb 19, 2022

LiJunnan1992 commented Feb 19, 2022

LiJunnan1992 commented Feb 19, 2022

LiJunnan1992 commented Feb 27, 2022 • edited Loading

LiJunnan1992 commented Feb 27, 2022 •

edited

Loading