Does llava-next-video deploy only focus on first frames? #510

LetheRiver0 · 2024-06-07T06:33:34Z

I'm trying to deploy llava-next-video with sglang, and it can successfully work. But I find it only focus on the first frame of input, like if I input 10 frames, and let model to describe it. And the generation only contains first frame's information. Dose anyone know what happend? Thanks~
Also, where can I print the input token for model? I want to check if all frames are input to model

AmazDeng · 2024-06-08T14:14:51Z

I'm having a similar problem to you.

I deployed sglang , and loaded the llava-next-image model, but sglang can only do a single inference. If I do batch inference, for example, batch_size=10, sglang can only reason about the first 5, and the last 5 get stuck and can't be reasoned
2.I'm trying to load the llava-next-model model for inference, but sglang can't reason the result

Luodian · 2024-07-23T16:29:09Z

Indeed, our first version code patch has the mentioned issue. We will send a new PR along with our new models to fix above mentioned issues. Sorry for keep you waiting.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does llava-next-video deploy only focus on first frames? #510

Does llava-next-video deploy only focus on first frames? #510

LetheRiver0 commented Jun 7, 2024 •

edited

Loading

AmazDeng commented Jun 8, 2024

Luodian commented Jul 23, 2024

Does llava-next-video deploy only focus on first frames? #510

Does llava-next-video deploy only focus on first frames? #510

Comments

LetheRiver0 commented Jun 7, 2024 • edited Loading

AmazDeng commented Jun 8, 2024

Luodian commented Jul 23, 2024

LetheRiver0 commented Jun 7, 2024 •

edited

Loading