Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : add multimodal support (LLaVA) #3332

Closed
aiaicode opened this issue Sep 25, 2023 · 9 comments · Fixed by #3436
Closed

llama : add multimodal support (LLaVA) #3332

aiaicode opened this issue Sep 25, 2023 · 9 comments · Fixed by #3436
Assignees

Comments

@aiaicode
Copy link

Now that OpenAI is adding voice and image to ChatGPT and will probably be the new norm, wouldn't it be a good idea for llama.cpp to also please add this to the roadmap? if possible?

@jagtesh
Copy link
Contributor

jagtesh commented Sep 26, 2023

It would depend on having access to high quality multi-modal models. I don't know if one exists yet, in the same league as llama.

@aiaicode
Copy link
Author

Hopefully Llama3 would be that.

@monatis
Copy link
Collaborator

monatis commented Sep 27, 2023

Yesterday LLaVA-RLHF was announced. It's the first open-source RLHF-trained multimodal model. And we previously had Idefics from HF. After introducing GGUF support in clip.cpp, now it's possible to implement multimodal inference by combining it with llama.cpp. Architecturally LLaVA is much simpler than Idefics, but if Idefics' performance is considerably better than LLaVA-RLHF, I can start with it as well. WDYT?

@ggerganov
Copy link
Owner

We should make a PoC (either as a separate repo or as an example in this repo) to implement LLaVA

@monatis
Copy link
Collaborator

monatis commented Oct 2, 2023

I started to work on LLaVA in another repo but it's extremely difficult to manage llama.cpp and clip.cpp together while depending two different versions of ggml, so it would be much easier for me if it's ok to implement it in this repo.

@ggerganov ggerganov changed the title Adding Multimodal Support in the Roadmap llama : add multimodal support (LLaVA) Oct 3, 2023
@Green-Sky
Copy link
Collaborator

pr: #3436

@aiaicode
Copy link
Author

Thank you @monatis ! You legend.

@ChrisW-priv
Copy link

Hi, do I understand correctly that the multimodal support is now added? how to run such a model using a cli? say I have a photo to analise and downloaded the zhiqings/LLaVA-RLHF-7b-v1.5-224 model from hugging face.

I am really new to the field, recently compiled llama.cpp locally, played aroud with it, can you point me to some materials/tutorials?

PS. when I saw the project I was quickly overwelmed. I could work on documentation of how to use it but I am soo new. do contibutors meet to discuss the developement or something ??

@svenstaro
Copy link

@ChrisW-priv Not sure this is still relevant to you but this is actually documented in the original MR:

./bin/llava -m ggml-model-q5_k.gguf --mmproj mmproj-model-f16.gguf --image path/to/an/image.jpg

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

7 participants