Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add video text to text docs #33164

Merged
merged 12 commits into from
Sep 1, 2024

Conversation

merveenoyan
Copy link
Contributor

Adding video-text-to-text task guide

@zucchini-nlp @NielsRogge @amyeroberts @stevhliu

@merveenoyan
Copy link
Contributor Author

There seems to be an issue with Llava, in the latest stable some image tokens are cropped, i.e. when I pass 12 downsampled frames and 12 tokens it tells me to add 13 image tokens instead. It's fixed in main, but now I get TypeError: LlavaForConditionalGeneration.forward() got an unexpected keyword argument 'num_logits_to_keep' from result = self._sample(input_ids, logits_processor=prepared_logits_processor).

@zucchini-nlp
Copy link
Member

Caused by #31292, will work on it

Copy link
Member

@stevhliu stevhliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, thanks for adding! 🔥

Remember to add the doc to the toctree!

docs/source/en/tasks/video_text_to_text.md Outdated Show resolved Hide resolved
docs/source/en/tasks/video_text_to_text.md Outdated Show resolved Hide resolved
docs/source/en/tasks/video_text_to_text.md Outdated Show resolved Hide resolved
docs/source/en/tasks/video_text_to_text.md Outdated Show resolved Hide resolved
docs/source/en/tasks/video_text_to_text.md Outdated Show resolved Hide resolved
docs/source/en/tasks/video_text_to_text.md Outdated Show resolved Hide resolved
docs/source/en/tasks/video_text_to_text.md Outdated Show resolved Hide resolved
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thanks for adding! ❤️

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay, thanks for adding this! Looks good but I wan thinking of adding an inference with pure VideoLLMs, WDYT?


Now we can preprocess the inputs.

This model has a prompt template that looks like following. First we'll put all sampled frames into one list. Since we have eight frames in each video, we will insert 12 `<image>` tokens to our prompt. Note that we are adding `assistant` at the end to trigger model to give answers. We can then preprocess.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm seems to be a typo, 8 frames each video makes total 12 frames?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I changed later to 7B model so should've modified this

- chat fine-tuned models for conversation
- instruction fine-tuned models

This guide focuses on inference with an instruction-tuned model, [llava-hf/llava-interleave-qwen-7b-hf](https://huggingface.co/llava-hf/llava-interleave-qwen-7b-hf) which can take in interleaved data. Alternatively, you can try [llava-interleave-qwen-0.5b-hf](https://huggingface.co/llava-hf/llava-interleave-qwen-0.5b-hf) if your hardware doesn't allow running a 7B model.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could/should add an example with pure VideoLLM where we don't have to manually replicate image token several times and where the model has special treatment for videos, like extra pooling layers

llava-next-video or video-llava can be an option for that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought most models are coming out as interleaved so actually using an interleaved example is good since they're harder to get started with. I can add simple VideoLLM example separately with chat templates though

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they are mostly interleaved. The difference with llava-interleave is that we didn't add a new model for that, so it's kinda an image LLM used for video. For all others I am trying to make two separate processors, for images and for videos, with their own special tokens

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay I'll add a video only one and modify when you make the processors, does it sound good?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, thanks :)

docs/source/en/tasks/video_text_to_text.md Show resolved Hide resolved
@merveenoyan
Copy link
Contributor Author

@zucchini-nlp re: slack discussions I'd say we merge this and edit when the processors are out.

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sounds good to me. We'll let users to discover how each model expects the inputs by model card, as there's no one standard yet and we don't support natively video-only LLMs

Approved, thanks! 💛

@merveenoyan merveenoyan merged commit 2e3f8f7 into huggingface:main Sep 1, 2024
8 checks passed
itazap pushed a commit to NielsRogge/transformers that referenced this pull request Sep 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants