Add video text to text docs #33164

merveenoyan · 2024-08-28T11:47:18Z

Adding video-text-to-text task guide

@zucchini-nlp @NielsRogge @amyeroberts @stevhliu

merveenoyan · 2024-08-28T11:48:46Z

There seems to be an issue with Llava, in the latest stable some image tokens are cropped, i.e. when I pass 12 downsampled frames and 12 tokens it tells me to add 13 image tokens instead. It's fixed in main, but now I get TypeError: LlavaForConditionalGeneration.forward() got an unexpected keyword argument 'num_logits_to_keep' from result = self._sample(input_ids, logits_processor=prepared_logits_processor).

zucchini-nlp · 2024-08-28T12:00:26Z

Caused by #31292, will work on it

stevhliu

Very nice, thanks for adding! 🔥

Remember to add the doc to the toctree!

docs/source/en/tasks/video_text_to_text.md

Co-authored-by: Steven Liu <[email protected]>

HuggingFaceDocBuilderDev · 2024-08-29T09:49:11Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

amyeroberts

LGTM - thanks for adding! ❤️

zucchini-nlp

Yay, thanks for adding this! Looks good but I wan thinking of adding an inference with pure VideoLLMs, WDYT?

zucchini-nlp · 2024-08-29T20:07:34Z

docs/source/en/tasks/video_text_to_text.md

+
+Now we can preprocess the inputs.
+
+This model has a prompt template that looks like following. First we'll put all sampled frames into one list. Since we have eight frames in each video, we will insert 12 `<image>` tokens to our prompt. Note that we are adding `assistant` at the end to trigger model to give answers. We can then preprocess.


Hmm seems to be a typo, 8 frames each video makes total 12 frames?

sorry I changed later to 7B model so should've modified this

zucchini-nlp · 2024-08-29T20:12:15Z

docs/source/en/tasks/video_text_to_text.md

+- chat fine-tuned models for conversation
+- instruction fine-tuned models
+
+This guide focuses on inference with an instruction-tuned model, [llava-hf/llava-interleave-qwen-7b-hf](https://huggingface.co/llava-hf/llava-interleave-qwen-7b-hf) which can take in interleaved data. Alternatively, you can try [llava-interleave-qwen-0.5b-hf](https://huggingface.co/llava-hf/llava-interleave-qwen-0.5b-hf) if your hardware doesn't allow running a 7B model.


Maybe we could/should add an example with pure VideoLLM where we don't have to manually replicate image token several times and where the model has special treatment for videos, like extra pooling layers

llava-next-video or video-llava can be an option for that

I thought most models are coming out as interleaved so actually using an interleaved example is good since they're harder to get started with. I can add simple VideoLLM example separately with chat templates though

Yes, they are mostly interleaved. The difference with llava-interleave is that we didn't add a new model for that, so it's kinda an image LLM used for video. For all others I am trying to make two separate processors, for images and for videos, with their own special tokens

okay I'll add a video only one and modify when you make the processors, does it sound good?

yep, thanks :)

docs/source/en/tasks/video_text_to_text.md

Co-authored-by: Steven Liu <[email protected]>

merveenoyan · 2024-08-30T10:47:19Z

@zucchini-nlp re: slack discussions I'd say we merge this and edit when the processors are out.

zucchini-nlp

Yes, sounds good to me. We'll let users to discover how each model expects the inputs by model card, as there's no one standard yet and we don't support natively video-only LLMs

Approved, thanks! 💛

--------- Co-authored-by: Steven Liu <[email protected]>

merveenoyan added 3 commits August 28, 2024 14:25

video-text-to-text task guide

4fbe42c

nit

7e7938e

nit

f109f24

zucchini-nlp mentioned this pull request Aug 28, 2024

Fix: num_logits_to_keep in composite models #33168

Merged

stevhliu approved these changes Aug 28, 2024

View reviewed changes

merveenoyan and others added 7 commits August 29, 2024 12:23

Update docs/source/en/tasks/video_text_to_text.md

7a25ee6

Co-authored-by: Steven Liu <[email protected]>

Update docs/source/en/tasks/video_text_to_text.md

24d8192

Co-authored-by: Steven Liu <[email protected]>

Update docs/source/en/tasks/video_text_to_text.md

d3056ad

Co-authored-by: Steven Liu <[email protected]>

Update docs/source/en/tasks/video_text_to_text.md

6384fbc

Co-authored-by: Steven Liu <[email protected]>

Update docs/source/en/tasks/video_text_to_text.md

c271ed4

Co-authored-by: Steven Liu <[email protected]>

Update docs/source/en/tasks/video_text_to_text.md

891d845

Co-authored-by: Steven Liu <[email protected]>

Readability pass and nits

c92e3bc

amyeroberts reviewed Aug 29, 2024

View reviewed changes

amyeroberts approved these changes Aug 29, 2024

View reviewed changes

zucchini-nlp reviewed Aug 30, 2024

View reviewed changes

Update docs/source/en/tasks/video_text_to_text.md

12f1cad

Co-authored-by: Steven Liu <[email protected]>

zucchini-nlp approved these changes Aug 30, 2024

View reviewed changes

Fix sampling

e9878c7

merveenoyan merged commit 2e3f8f7 into huggingface:main Sep 1, 2024
8 checks passed

itazap pushed a commit to NielsRogge/transformers that referenced this pull request Sep 20, 2024

Add video text to text docs (huggingface#33164)

bec52c1

--------- Co-authored-by: Steven Liu <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add video text to text docs #33164

Add video text to text docs #33164

merveenoyan commented Aug 28, 2024

merveenoyan commented Aug 28, 2024

zucchini-nlp commented Aug 28, 2024

stevhliu left a comment •

edited

Loading

HuggingFaceDocBuilderDev commented Aug 29, 2024

amyeroberts left a comment

zucchini-nlp left a comment

zucchini-nlp Aug 29, 2024

merveenoyan Aug 30, 2024

zucchini-nlp Aug 29, 2024

merveenoyan Aug 30, 2024

zucchini-nlp Aug 30, 2024

merveenoyan Aug 30, 2024

zucchini-nlp Aug 30, 2024

merveenoyan commented Aug 30, 2024

zucchini-nlp left a comment


		Now we can preprocess the inputs.

		This model has a prompt template that looks like following. First we'll put all sampled frames into one list. Since we have eight frames in each video, we will insert 12 `<image>` tokens to our prompt. Note that we are adding `assistant` at the end to trigger model to give answers. We can then preprocess.

Add video text to text docs #33164

Add video text to text docs #33164

Conversation

merveenoyan commented Aug 28, 2024

merveenoyan commented Aug 28, 2024

zucchini-nlp commented Aug 28, 2024

stevhliu left a comment • edited Loading

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Aug 29, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

zucchini-nlp left a comment

Choose a reason for hiding this comment

zucchini-nlp Aug 29, 2024

Choose a reason for hiding this comment

merveenoyan Aug 30, 2024

Choose a reason for hiding this comment

zucchini-nlp Aug 29, 2024

Choose a reason for hiding this comment

merveenoyan Aug 30, 2024

Choose a reason for hiding this comment

zucchini-nlp Aug 30, 2024

Choose a reason for hiding this comment

merveenoyan Aug 30, 2024

Choose a reason for hiding this comment

zucchini-nlp Aug 30, 2024

Choose a reason for hiding this comment

merveenoyan commented Aug 30, 2024

zucchini-nlp left a comment

Choose a reason for hiding this comment

stevhliu left a comment •

edited

Loading