-
Notifications
You must be signed in to change notification settings - Fork 27.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add video text to text docs #33164
Add video text to text docs #33164
Conversation
Caused by #31292, will work on it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice, thanks for adding! 🔥
Remember to add the doc to the toctree
!
Co-authored-by: Steven Liu <[email protected]>
Co-authored-by: Steven Liu <[email protected]>
Co-authored-by: Steven Liu <[email protected]>
Co-authored-by: Steven Liu <[email protected]>
Co-authored-by: Steven Liu <[email protected]>
Co-authored-by: Steven Liu <[email protected]>
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - thanks for adding! ❤️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yay, thanks for adding this! Looks good but I wan thinking of adding an inference with pure VideoLLMs, WDYT?
|
||
Now we can preprocess the inputs. | ||
|
||
This model has a prompt template that looks like following. First we'll put all sampled frames into one list. Since we have eight frames in each video, we will insert 12 `<image>` tokens to our prompt. Note that we are adding `assistant` at the end to trigger model to give answers. We can then preprocess. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm seems to be a typo, 8 frames each video makes total 12 frames?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry I changed later to 7B model so should've modified this
- chat fine-tuned models for conversation | ||
- instruction fine-tuned models | ||
|
||
This guide focuses on inference with an instruction-tuned model, [llava-hf/llava-interleave-qwen-7b-hf](https://huggingface.co/llava-hf/llava-interleave-qwen-7b-hf) which can take in interleaved data. Alternatively, you can try [llava-interleave-qwen-0.5b-hf](https://huggingface.co/llava-hf/llava-interleave-qwen-0.5b-hf) if your hardware doesn't allow running a 7B model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could/should add an example with pure VideoLLM where we don't have to manually replicate image
token several times and where the model has special treatment for videos, like extra pooling layers
llava-next-video or video-llava can be an option for that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought most models are coming out as interleaved so actually using an interleaved example is good since they're harder to get started with. I can add simple VideoLLM example separately with chat templates though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they are mostly interleaved. The difference with llava-interleave is that we didn't add a new model for that, so it's kinda an image LLM used for video. For all others I am trying to make two separate processors, for images and for videos, with their own special tokens
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay I'll add a video only one and modify when you make the processors, does it sound good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, thanks :)
Co-authored-by: Steven Liu <[email protected]>
@zucchini-nlp re: slack discussions I'd say we merge this and edit when the processors are out. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sounds good to me. We'll let users to discover how each model expects the inputs by model card, as there's no one standard yet and we don't support natively video-only LLMs
Approved, thanks! 💛
--------- Co-authored-by: Steven Liu <[email protected]>
Adding video-text-to-text task guide
@zucchini-nlp @NielsRogge @amyeroberts @stevhliu