-
Notifications
You must be signed in to change notification settings - Fork 734
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Details of finetuning an Image-to-Video model #211
Comments
And just a follow-up question, approximately how much resources is needed to finetune a I2V model from a T2V model? Thanks! |
SVD replaces the text embeddings with CLIP image embeddings, then the model will no longer support text prompt input, which is something people don’t want to see. |
Thank you for the reply. And just to clarify, the first frame is only concatenated to the input channel wise after 3D VAE and nothing else? |
yes |
Great work! I noticed in your implementation, to add the image condition, there are some options:
By default, self.noised_image_all_concat = False. Did you also try setting self.noised_image_all_concat = True? How is the performance? |
Really awesome work!
The Appendix of the paper mentions how to finetune an Image-to-Video model from T2V model. Similar to SVD, the noised version of the condition image is concatenated channel wise.
Is the condition image added to the model through other methods? (For instance, SVD also replaces the text embeddings with CLIP image embeddings, does CogVideoX I2V also replace the original text embeddings with some kind of image embeddings?)
Thank you.
The text was updated successfully, but these errors were encountered: