llama : add multimodal support (LLaVA) #3332

aiaicode · 2023-09-25T20:53:36Z

Now that OpenAI is adding voice and image to ChatGPT and will probably be the new norm, wouldn't it be a good idea for llama.cpp to also please add this to the roadmap? if possible?

jagtesh · 2023-09-26T23:33:54Z

It would depend on having access to high quality multi-modal models. I don't know if one exists yet, in the same league as llama.

aiaicode · 2023-09-27T13:55:07Z

Hopefully Llama3 would be that.

monatis · 2023-09-27T14:10:43Z

Yesterday LLaVA-RLHF was announced. It's the first open-source RLHF-trained multimodal model. And we previously had Idefics from HF. After introducing GGUF support in clip.cpp, now it's possible to implement multimodal inference by combining it with llama.cpp. Architecturally LLaVA is much simpler than Idefics, but if Idefics' performance is considerably better than LLaVA-RLHF, I can start with it as well. WDYT?

ggerganov · 2023-10-02T08:40:44Z

We should make a PoC (either as a separate repo or as an example in this repo) to implement LLaVA

monatis · 2023-10-02T08:53:54Z

I started to work on LLaVA in another repo but it's extremely difficult to manage llama.cpp and clip.cpp together while depending two different versions of ggml, so it would be much easier for me if it's ok to implement it in this repo.

Green-Sky · 2023-10-09T22:56:04Z

pr: #3436

aiaicode · 2023-10-10T20:53:22Z

Thank you @monatis ! You legend.

ChrisW-priv · 2024-01-18T19:06:03Z

Hi, do I understand correctly that the multimodal support is now added? how to run such a model using a cli? say I have a photo to analise and downloaded the zhiqings/LLaVA-RLHF-7b-v1.5-224 model from hugging face.

I am really new to the field, recently compiled llama.cpp locally, played aroud with it, can you point me to some materials/tutorials?

PS. when I saw the project I was quickly overwelmed. I could work on documentation of how to use it but I am soo new. do contibutors meet to discuss the developement or something ??

svenstaro · 2024-02-08T02:48:00Z

@ChrisW-priv Not sure this is still relevant to you but this is actually documented in the original MR:

./bin/llava -m ggml-model-q5_k.gguf --mmproj mmproj-model-f16.gguf --image path/to/an/image.jpg

monatis mentioned this issue Oct 2, 2023

Implement multimodal models (LLaVA) #3436

Merged

5 tasks

ggerganov changed the title ~~Adding Multimodal Support in the Roadmap~~ llama : add multimodal support (LLaVA) Oct 3, 2023

ggerganov added the research 🔬 label Oct 3, 2023

ggerganov assigned monatis Oct 4, 2023

monatis closed this as completed in #3436 Oct 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : add multimodal support (LLaVA) #3332

llama : add multimodal support (LLaVA) #3332

aiaicode commented Sep 25, 2023

jagtesh commented Sep 26, 2023

aiaicode commented Sep 27, 2023

monatis commented Sep 27, 2023

ggerganov commented Oct 2, 2023

monatis commented Oct 2, 2023

Green-Sky commented Oct 9, 2023

aiaicode commented Oct 10, 2023

ChrisW-priv commented Jan 18, 2024

svenstaro commented Feb 8, 2024

llama : add multimodal support (LLaVA) #3332

llama : add multimodal support (LLaVA) #3332

Comments

aiaicode commented Sep 25, 2023

jagtesh commented Sep 26, 2023

aiaicode commented Sep 27, 2023

monatis commented Sep 27, 2023

ggerganov commented Oct 2, 2023

monatis commented Oct 2, 2023

Green-Sky commented Oct 9, 2023

aiaicode commented Oct 10, 2023

ChrisW-priv commented Jan 18, 2024

svenstaro commented Feb 8, 2024