Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Details of finetuning an Image-to-Video model #211

Closed
tedfeng424 opened this issue Aug 30, 2024 · 5 comments
Closed

Details of finetuning an Image-to-Video model #211

tedfeng424 opened this issue Aug 30, 2024 · 5 comments
Assignees
Labels
good first issue Good for newcomers

Comments

@tedfeng424
Copy link

Really awesome work!

The Appendix of the paper mentions how to finetune an Image-to-Video model from T2V model. Similar to SVD, the noised version of the condition image is concatenated channel wise.

Is the condition image added to the model through other methods? (For instance, SVD also replaces the text embeddings with CLIP image embeddings, does CogVideoX I2V also replace the original text embeddings with some kind of image embeddings?)

Thank you.

Screenshot 2024-08-29 at 7 24 24 PM

@tedfeng424
Copy link
Author

And just a follow-up question, approximately how much resources is needed to finetune a I2V model from a T2V model?

Thanks!

@tengjiayan20
Copy link
Contributor

SVD replaces the text embeddings with CLIP image embeddings, then the model will no longer support text prompt input, which is something people don’t want to see.
As for the optimal training amount, we are still in the exploration stage.

@tedfeng424
Copy link
Author

Thank you for the reply. And just to clarify, the first frame is only concatenated to the input channel wise after 3D VAE and nothing else?

@tengjiayan20
Copy link
Contributor

Thank you for the reply. And just to clarify, the first frame is only concatenated to the input channel wise after 3D VAE and nothing else?

yes

@yzy-thu yzy-thu added the good first issue Good for newcomers label Sep 1, 2024
@hw-liang
Copy link

hw-liang commented Sep 24, 2024

Great work! I noticed in your implementation, to add the image condition, there are some options:

if self.noised_image_all_concat: image = image.repeat(1, x.shape[1], 1, 1, 1) else: image = torch.concat([image, torch.zeros_like(x[:, 1:])], dim=1)

By default, self.noised_image_all_concat = False. Did you also try setting self.noised_image_all_concat = True? How is the performance?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

4 participants