Details of finetuning an Image-to-Video model #211

tedfeng424 · 2024-08-30T02:36:33Z

Really awesome work!

The Appendix of the paper mentions how to finetune an Image-to-Video model from T2V model. Similar to SVD, the noised version of the condition image is concatenated channel wise.

Is the condition image added to the model through other methods? (For instance, SVD also replaces the text embeddings with CLIP image embeddings, does CogVideoX I2V also replace the original text embeddings with some kind of image embeddings?)

Thank you.

tedfeng424 · 2024-08-31T02:59:31Z

And just a follow-up question, approximately how much resources is needed to finetune a I2V model from a T2V model?

Thanks!

tengjiayan20 · 2024-08-31T17:20:12Z

SVD replaces the text embeddings with CLIP image embeddings, then the model will no longer support text prompt input, which is something people don’t want to see.
As for the optimal training amount, we are still in the exploration stage.

tedfeng424 · 2024-08-31T18:53:44Z

Thank you for the reply. And just to clarify, the first frame is only concatenated to the input channel wise after 3D VAE and nothing else?

tengjiayan20 · 2024-08-31T18:57:59Z

Thank you for the reply. And just to clarify, the first frame is only concatenated to the input channel wise after 3D VAE and nothing else?

yes

hw-liang · 2024-09-24T20:53:55Z

Great work! I noticed in your implementation, to add the image condition, there are some options:

if self.noised_image_all_concat: image = image.repeat(1, x.shape[1], 1, 1, 1) else: image = torch.concat([image, torch.zeros_like(x[:, 1:])], dim=1)

By default, self.noised_image_all_concat = False. Did you also try setting self.noised_image_all_concat = True? How is the performance?

zRzRzRzRzRzRzR assigned tengjiayan20 Aug 30, 2024

tedfeng424 closed this as completed Sep 1, 2024

yzy-thu added the good first issue Good for newcomers label Sep 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Details of finetuning an Image-to-Video model #211

Details of finetuning an Image-to-Video model #211

tedfeng424 commented Aug 30, 2024

tedfeng424 commented Aug 31, 2024

tengjiayan20 commented Aug 31, 2024

tedfeng424 commented Aug 31, 2024

tengjiayan20 commented Aug 31, 2024

hw-liang commented Sep 24, 2024 •

edited

Loading

Details of finetuning an Image-to-Video model #211

Details of finetuning an Image-to-Video model #211

Comments

tedfeng424 commented Aug 30, 2024

tedfeng424 commented Aug 31, 2024

tengjiayan20 commented Aug 31, 2024

tedfeng424 commented Aug 31, 2024

tengjiayan20 commented Aug 31, 2024

hw-liang commented Sep 24, 2024 • edited Loading

hw-liang commented Sep 24, 2024 •

edited

Loading