Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Yi-VL and a templating addon/fix for mobileVLM #5093

Merged
merged 5 commits into from
Jan 27, 2024

Conversation

cmp-nct
Copy link
Contributor

@cmp-nct cmp-nct commented Jan 23, 2024

mobileVLM support was recently added, the readme says the following:
./llava-cli -m MobileVLM-1.7B/ggml-model-q4_k.gguf \ --mmproj MobileVLM-1.7B/mmproj-model-f16.gguf \ --image path/to/an/image.jpg \ -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWho is the author of this book? Answer the question using a single word or phrase. ASSISTANT:"
@XiaotaoChen This can't work, the prompt in llava-cli is used as User-question prompt, not as full template.
So if you use it like that in Master you'll have a Vicuna system prompt, then the image embedding, then your entire template prompt followed by double "ASSISTANT:ASSISTANT:"
With this PR it should work, I've not tested mobileVLM yet.

What this does it is looks for <image> in the prompt, if it's present it splits the prompt up into a system and user prompt, interjects the image embeddings between.

For Yi-VL-6B, example:
.\bin\Debug\llava-cli.exe -m Q:\models\llava\Yi-VL-6B\ggml-model-f16.gguf --mmproj Q:\models\llava\Yi-VL-6B\vit\mmproj-model-f16.gguf --image C:\temp\license_demo.jpg -p "This is a chat between an inquisitive human and an AI assistant. Assume the role of the AI assistant. Read all the images carefully, and respond to the human's questions with informative, helpful, detailed and polite answers. 这是一个好奇的人类和一个人工智能助手之间的对话。假设你扮演这个AI助手的角 色。仔细阅读所有的图像,并对人类的问题做出信息丰富、有帮助、详细的和礼貌的回答。 \n\n### Human: <image>\nProvide a complete representation of what is in this image. Respond in JSON-pretty-print syntax for database insert.\n### Assistant:" -ngl 50 --temp 0 -n 500 -c 2048 -e
"Provide a complete representation of what is in this image. Respond in JSON-pretty-print syntax for database insert." is the Question

Yi-VL support:
Yi-VL uses a layer-norm in addition to the projector, it uses a larger 448x image with "huge" ViT (twice the size of llava-1.5).
Sadly in my tests it hallucinated strongly, it's image VQA is still SOTA for llava-based multimodals. A strange combination.

I added:

  • PROJECTOR_TYPE_MLP_NORM which is automatically switched to if the projector tensors are found.
  • I added a NULL init for the projector tensors, to allow accessing them if not initialized.
  • I added a couple try/catch blocks, there are other ways to do it. Generally the get_tensor function throws an exception so maybe we should catch it always ?

@cmp-nct
Copy link
Contributor Author

cmp-nct commented Jan 23, 2024

Demo of Yi-VL-6B is here: #5092

@cmp-nct cmp-nct mentioned this pull request Jan 23, 2024
4 tasks
@cmp-nct cmp-nct marked this pull request as ready for review January 23, 2024 04:55
examples/llava/clip.cpp Outdated Show resolved Hide resolved
examples/llava/llava-cli.cpp Outdated Show resolved Hide resolved
bugfix for new conversions
Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly in my tests it hallucinated strongly, it's image detection are still SOTA for llava-based multimodals.

What does "image detection" mean in this context?

@cmp-nct
Copy link
Contributor Author

cmp-nct commented Jan 25, 2024

Sadly in my tests it hallucinated strongly, it's image detection are still SOTA for llava-based multimodals.

What does "image detection" mean in this context?

My english writing degrades with daytime, I think that was in the early morning after a 16 hour day;)
I just meant that Yi-Vl-34B image analysis capabilities (VQA) were quite good for llava-type models, it detected some finer nuances in images I use for tests and it used good language to describe them - but it hallucinates a lot.

@cmp-nct
Copy link
Contributor Author

cmp-nct commented Jan 26, 2024

Just in general, it's not related to this PR but that's the claim of Yi-VL:
image

Based on their benchmark they are the direct competitor to GPT4v and CogVLM is the worst of all.
The performance of Yi on my license demo reflected the professionality of their benchmarking skills.
In reality CogVLM is the competitor to GPT4-V (sadly not supported on llama.cpp).

Also the training time is fishy, they claim they trained on "128 A800 GPUs with 80GB each" for 10 days.
Similar llava based models would finish their entire training+finetuning in an hour on that setup.

The Yi LLMs were quite good, somehow they forgot their professionality when doing the VL variant.

@ggerganov
Copy link
Owner

There is also chance that the current implementation in llama.cpp has a problem. Have you tried running the reference implementation on your tests?

Is it just the license photo that you are testing with? Seems like a very small sample to make a call of the overall performance

@cmp-nct
Copy link
Contributor Author

cmp-nct commented Jan 26, 2024

There is also chance that the current implementation in llama.cpp has a problem. Have you tried running the reference implementation on your tests?

Is it just the license photo that you are testing with? Seems like a very small sample to make a call of the overall performance

I agree on the potential implementation problem, though it works pretty well for that and I didn't find something obvious.
I've tested it on a range of images, it was not that bad but the hallucinations make it quite untrustworthy imho.
The license demo is indeed a difficult and specific picture to test, though I've tested it with all models and they all did better than Yi-VL. Better OCR, better instruction handling and less hallucination.
Cov-VLM and GPT4V provide a full JSON with 1-2 minor mistakes, llava-1.5/sharegpt provide a low featured output with mid hallucinations.
If anyone claims to beat all others, I'd expect them to ace the license picture as well.

I think the PR is ok for a merge, even if there is an issue remaining - it adds the capability of layer norm llava models and the templating. From there we can fix/add further details.

Below is the cats reference example, the output looks good to me but it is slightly different which could also be attributed to the image algorithm used in llama.cpp compared to the higher quality one in python.
Also the template in use, I've not debugged it in the python code but used the one they stated to be in use. If there is even a single space or newline different the output changes quite a bit.
They have quite a few typos in their official page, it all appears rushed to me. So maybe there is a difference here as well.

image

Reference:

----------
Since the temperature is set to 0.2 by default, the ourput is not always the same. An example output is:
question: Describe the cats and what they are doing in detail.
outputs: In the image, there are three cats situated on a stone floor. The first cat, with a mix of black, orange, and white fur, is actively eating from a metal bowl. The second cat, which is entirely black, is also engaged in eating from a separate metal bowl. The third cat, a mix of gray and white, is not eating but is instead looking off to the side, seemingly distracted from the food. The bowls are positioned close to each other, and the cats are all within a similar proximity to the bowls. The scene captures a typical moment of feline behavior, with some cats enjoying their meal while others appear indifferent or distracted.
----------

Q5K + fp16:

...................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 5000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   368.00 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =   112.00 MiB
llama_new_context_with_model: KV self size  =  480.00 MiB, K (f16):  240.00 MiB, V (f16):  240.00 MiB
llama_new_context_with_model:  CUDA_Host input buffer size   =    18.01 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   270.00 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =   270.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    14.00 MiB
llama_new_context_with_model: graph splits (measure): 5

encode_image_with_clip: image encoded in   287.93 ms by CLIP (    0.28 ms per image patch)
system_prompt: This is a chat between an inquisitive human and an AI assistant. Assume the role of the AI assistant. Read all the images carefully, and respond to the human's questions with informative, helpful, detailed and polite answers. 这是一个 好奇的人类和一个人工智能助手之间的对话。假设你扮演这个AI 助手的角色。仔细阅读所有的图像,并对人类的问题做出信息丰富、有帮助、详细的和礼貌的回答。

### Human:
user_prompt:
Describe the cats and what they are doing in detail.
### Assistant:

 In the image, there are two cats situated on a stone floor. The cat on the left is a calico cat with white, orange, and black fur. It appears to be observing something on the ground, possibly intrigued by it. On the right side of the image, there's a black cat that seems to be eating from a metal bowl placed nearby. Both cats are facing away from the camera, giving us a clear view of their actions. The stone floor they are standing on adds a rustic touch to the scene. There is no text present in the image.

Q3K + Q6K:
..................................................................................................

llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 5000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   368.00 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =   112.00 MiB
llama_new_context_with_model: KV self size  =  480.00 MiB, K (f16):  240.00 MiB, V (f16):  240.00 MiB
llama_new_context_with_model:  CUDA_Host input buffer size   =    18.01 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   270.00 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =   270.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    14.00 MiB
llama_new_context_with_model: graph splits (measure): 5

encode_image_with_clip: image encoded in   144.66 ms by CLIP (    0.14 ms per image patch)
system_prompt: This is a chat between an inquisitive human and an AI assistant. Assume the role of the AI assistant. Read all the images carefully, and respond to the human's questions with informative, helpful, detailed and polite answers. 这是一个 好奇的人类和一个人工智能助手之间的对话。假设你扮演这个AI 助手的角色。仔细阅读所有的图像,并对人类的问题做出信息丰富、有帮助、详细的和礼貌的回答。

### Human:
user_prompt:
Describe the cats and what they are doing in detail.
### Assistant:

In the image, there are two cats situated on a stone floor. The cat on the left is a tabby with a mix of black and orange fur. It appears to be observing something on the ground, possibly an object or another animal. On the right side of the image, there's a black and white cat. This cat seems to be engaged in eating from a bowl placed nearby. The two cats are positioned close to each other but are facing different directions, indicating they might be focused on separate activities.

With higher temperature I also had this:

In the image, there are three cats situated on a stone floor. The first cat is positioned on the left side of the frame and is characterized by its brown and black fur. It appears to be observing something with interest.

The second cat is located in the center of the image and has gray and white fur. This cat seems to be engaged in an activity, possibly related to food or play.

The third cat is situated on the right side of the image and is black in color. It appears to be eating from a bowl placed on the floor, which could suggest that it's being fed by someone.

All three cats are facing towards the bottom right corner of the image, indicating that their attention might be drawn to something outside of the frame. The precise nature of this object or event remains unknown from the image alone.

For comparison, this is ShareGPT4V-13B:

 In the image, there are three cats of different colors and sizes. The first cat is a large black cat sitting on the ground with its head in a bowl, eating food from it. The second cat is a small orange tabby cat standing behind the first cat, also eating from a bowl. The third cat is a small gray tabby cat, which is hiding behind the other two cats and peeking out. All three cats are engaged in eating their meal from separate bowls placed on the ground.

It has one hallucination regarding the 3rd cat also eating.

(What we really need is CogVLM support, but that needs a custom llama architecture - it separates the visual token attention.)

@cmp-nct
Copy link
Contributor Author

cmp-nct commented Jan 26, 2024

I uploaded a couple GGUF variants for both Yi-VL on HF: https://huggingface.co/cmp-nct
If anyone wants to try

@chigkim
Copy link

chigkim commented Jan 27, 2024

I uploaded a couple GGUF variants for both Yi-VL on HF: https://huggingface.co/cmp-nct If anyone wants to try

Thanks for these! Just wondering, are you planning to release 34B with higher quants?

@cmp-nct
Copy link
Contributor Author

cmp-nct commented Jan 27, 2024

I uploaded a couple GGUF variants for both Yi-VL on HF: https://huggingface.co/cmp-nct If anyone wants to try

Thanks for these! Just wondering, are you planning to release 34B with higher quants?

I'm uploading a Q5K one, you'll need CPU offloading, dual GPU or a >24GB vram gpu for the bigger ones

@ggerganov ggerganov merged commit 6db2b41 into ggerganov:master Jan 27, 2024
45 checks passed
@aisensiy
Copy link

https://huggingface.co/01-ai/Yi-VL-34B/discussions/10#65b337ea321c51cd17d06135

It seems that a different evaluation dataset may get a quite different evaluation result...

@TingTingin
Copy link

any info on how well the 34b version performs on descriptions of images (non licenses)?

@cmp-nct
Copy link
Contributor Author

cmp-nct commented Jan 27, 2024

any info on how well the 34b version performs on descriptions of images (non licenses)?

I recommend to try it out.
You can see the cat image as an example of a normal photo.
I also tried it on a series of VQA which were solved quite good by the 34B but rare hallucinations were an issue

One more issue I'm having with it is that it breaks instruction tuning.
It is supposed to output a "stopword" ### when finished but I've had quite some cases where it did not output the stopword, instead it wrote Human: Some bullshit question and answered it.
I did not dig into this behavior. In case there is no glitch in the code regarding the stopword, "Human:" can be used as another stopword.

Overall I recommend testing it for your use case.

@mirek190
Copy link

mirek190 commented Jan 27, 2024

Your picture with CogVLM - 17b - temp 0.1

There are three cats in the picture. The one on the left is a calico cat with a mix of black, white, and orange fur. It is bending its head down, seemingly eating from a silver bowl placed on the ground. 
The middle cat is a brown and white cat, peeking out from a small hole in the wall, observing the other two cats. The cat on the right is a black cat, also bowing its head down, eating from a silver bowl.

@chigkim
Copy link

chigkim commented Jan 28, 2024

I'm uploading a Q5K one, you'll need CPU offloading, dual GPU or a >24GB vram gpu for the bigger ones

Thanks so much for the higher quant!
Is mmproj-model-f16-q6_k.gguf quantized from this clip-vit?
https://huggingface.co/01-ai/Yi-VL-34B/tree/main/vit/clip-vit-H-14-laion2B-s32B-b79K-yi-vl-34B-448
There are open_clip_pytorch_model.bin and pytorch_model.bin.
@cmp-nct Do you mind sharing your command to quantize and obtain mmproj file?

@cmp-nct
Copy link
Contributor Author

cmp-nct commented Jan 28, 2024

I'm uploading a Q5K one, you'll need CPU offloading, dual GPU or a >24GB vram gpu for the bigger ones

Thanks so much for the higher quant! Is mmproj-model-f16-q6_k.gguf quantized from this clip-vit? https://huggingface.co/01-ai/Yi-VL-34B/tree/main/vit/clip-vit-H-14-laion2B-s32B-b79K-yi-vl-34B-448 There are open_clip_pytorch_model.bin and pytorch_model.bin. @cmp-nct Do you mind sharing your command to quantize and obtain mmproj file?

Yes, both of my uploaded mmproj projectors are from the respective correct VIT.
The open_clip file is not needed, the pytorch one is the compatible format.

I currently use a hacked together binary, nothing I'd want to PR here. The quantization functions are in clip.cpp

If you need something else than Q6K let me know, I can upload it.
I only tested Q6K, Q80 and fp16. Found no real difference between them in output quality

@chigkim
Copy link

chigkim commented Jan 29, 2024

If you need something else than Q6K let me know, I can upload it. I only tested Q6K, Q80 and fp16. Found no real difference between them in output quality

I noticed a lot of Llava based model provided mmproj just in f16. I wondered why.

If it's not too much trouble, I'd love to try model in Q4_K_M with mmproj in f16.

Thanks!

@cmp-nct
Copy link
Contributor Author

cmp-nct commented Jan 29, 2024

If you need something else than Q6K let me know, I can upload it. I only tested Q6K, Q80 and fp16. Found no real difference between them in output quality

I noticed a lot of Llava based model provided mmproj just in f16. I wondered why.
There is no tool for it at the moment, that's the main reason I guess.
The k-type support PR is also rather new, most llava models on hf are likely older.

If it's not too much trouble, I'd love to try model in Q4_K_M with mmproj in f16.

Thanks!

I'm uploading both, will take a while for the Q4K

@barbouh
Copy link

barbouh commented Jan 29, 2024

Is there any way to stop the model from outputting things like "Human:"? It continues to chat by itself until it reaches end of token.
I guess stopwords/antiprompt do not work within the webgui?

It is hallucinating extremely much, ShareGPT-13B is waaay better.

@chigkim
Copy link

chigkim commented Jan 29, 2024

Thanks @cmp-nct for q4 model and f16 mmproj!
Have you tried running any quantized model with Llama.cpp server? I kind of get similar result to @barbouh. For me, it keeps printing out empty strings after it generates text.
server -v -c 0 -m yi-vl/ggml-model-Q4_K.gguf --mmproj yi-vl/mmproj-model-f16.gguf
Then fill out:
Prompt: This is a chat between an inquisitive human and an AI assistant. Assume the role of the AI assistant. Read all the images carefully, and respond to the human's questions with informative, helpful, detailed and polite answers. 这是一个好奇的人类和一个人工智能助手之间的对话。假设你扮演这个AI助手的角 色。仔细阅读所有的图像,并对人类的问题做出信息丰富、有帮助、详细的和礼貌的回答。
User name: ### Human
Bot name: ### Assistant
Prediction: 2048
Temperature: 0

@cmp-nct
Copy link
Contributor Author

cmp-nct commented Jan 29, 2024

Thanks @cmp-nct for q4 model and f16 mmproj! Have you tried running any quantized model with Llama.cpp server? I kind of get similar result to @barbouh. For me, it keeps printing out empty strings after it generates text. server -v -c 0 -m yi-vl/ggml-model-Q4_K.gguf --mmproj yi-vl/mmproj-model-f16.gguf Then fill out: Prompt: This is a chat between an inquisitive human and an AI assistant. Assume the role of the AI assistant. Read all the images carefully, and respond to the human's questions with informative, helpful, detailed and polite answers. 这是一个好奇的人类和一个人工智能助手之间的对话。假设你扮演这个AI助手的角 色。仔细阅读所有的图像,并对人类的问题做出信息丰富、有帮助、详细的和礼貌的回答。 User name: ### Human Bot name: ### Assistant Prediction: 2048 Temperature: 0

I've not used the server example, I only made it compatible for llava-cli

You definitely need to add this stopword support (it's probably just a line at the right place to .find('###') into the server example if nothing like that is already available.
From my experience you might also need to check for 'Human:' as the model sometimes appeared to not use the stopword.

@chigkim
Copy link

chigkim commented Jan 30, 2024

Yeah I think server uses Bot Name as stopwords, but for some reason, the model keeps printing out things that come out as empty strings.

The main reason I like using the server is because llava-cli -i for interactive doesn't work for some reason. It just exits after one completion, so you can't have multi-turn chat.

Also server keeps the model loaded, so you can easily load different images quickly.

Lastly, it has nice API over http.

Could you look at it if you have a chance? I'd appreciate it.

jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Feb 3, 2024
* Support for Yi-VL, templating fix for mobileVLM

* ws

* Update examples/llava/clip.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

* Update llava-cli.cpp

* Update clip.cpp

bugfix for new conversions

---------

Co-authored-by: Georgi Gerganov <[email protected]>
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
* Support for Yi-VL, templating fix for mobileVLM

* ws

* Update examples/llava/clip.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

* Update llava-cli.cpp

* Update clip.cpp

bugfix for new conversions

---------

Co-authored-by: Georgi Gerganov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants