Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] How to use in-context learning in LLaVA? #1357

Open
Dinghaoxuan opened this issue Apr 1, 2024 · 5 comments
Open

[Question] How to use in-context learning in LLaVA? #1357

Dinghaoxuan opened this issue Apr 1, 2024 · 5 comments

Comments

@Dinghaoxuan
Copy link

Question

Hello, I want to input some in-context examples to LLaVA. But I can not find any guidance about how to insert images in input prompt. Could you give me some templates about multi-image input prompt? Thank you very much.

@iamjudit
Copy link

iamjudit commented Apr 5, 2024

I'm also trying to send multiple images for a few-shot request via the pipeline. Thanks in advance

@Deep1994
Copy link

Deep1994 commented Apr 7, 2024

I'm also trying to send multiple images for a few-shot request via the pipeline. Thanks in advance

hi, I want to know if you guys have found a solution?

@Dinghaoxuan
Copy link
Author

I'm also trying to send multiple images for a few-shot request via the pipeline. Thanks in advance

hi, I want to know if you guys have found a solution?

I attempt to enclose each question and answer with symbols your sentences , but the in-context learning ability of LLaVA is poor. The in-context answers disturb the answer for query question.

@Dinghaoxuan
Copy link
Author

I'm also trying to send multiple images for a few-shot request via the pipeline. Thanks in advance

I attempt to enclose each question and answer with symbols your sentences , but the in-context learning ability of LLaVA is poor. The in-context answers disturb the answer for query question.

@ajaymin28
Copy link

ajaymin28 commented Jun 18, 2024

Question

Hello, I want to input some in-context examples to LLaVA. But I can not find any guidance about how to insert images in input prompt. Could you give me some templates about multi-image input prompt? Thank you very much.

Ok, so I have got it done by making changes like this, its limited to two images, and maybe I'm not sure how many images can be added to this.

from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX

from llava.conversation import conv_templates, SeparatorStyle

from PIL import Image
import requests
import copy
import torch


pretrained = "lmms-lab/llama3-llava-next-8b"
model_name = "llava_llama_3"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map) # Add any other thing you want to pass in llava_model_args

model.eval()
model.tie_weights()

url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
image = Image.open(requests.get(url, stream=True).raw)

image_tensor = process_images([image], image_processor, model.config)
image_tensor_list = [_image.to(dtype=torch.float16, device=device) for _image in image_tensor] # Jai: replace image_tensor by image_tensor_list that will be used to append second image

## Jai: add second image ["digitally altered image of a person standing in the water"]
image2 = Image.open("./LLaVA-NeXT/inputs/9f776e16-0d07-40d7-b2fd-45e23267f79b.jpg")
image_tensor2 = process_images([image2], image_processor, model.config)
for _image in image_tensor2:
    image_tensor_list.append(_image.to(dtype=torch.float16, device=device))


Instruction_COT = """There are two different images provided as an input, describe each of them independently""" 
conv_template = "llava_llama_3" # Make sure you use correct chat template for different models

 
question = DEFAULT_IMAGE_TOKEN + DEFAULT_IMAGE_TOKEN + f"\n{Instruction_COT}\n" # Jai: add second Image token in the question. By default there is only one.
conv = copy.deepcopy(conv_templates[conv_template]) 
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()

input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0) 
image_sizes = [image.size, image2.size] # Jai: add second image size here.


cont = model.generate(
    input_ids,
    images=image_tensor_list,
    image_sizes=image_sizes,
    do_sample=False,
    temperature=0,
    max_new_tokens=2024,
)
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
print(text_outputs)

Here's the output,

'\nThe image on the left appears to be a radar chart, also known as a spider chart or a web chart. This type of chart is used to display multivariate data in the form of a two-dimensional chart of three or more quantitative variables represented on axes starting from the same point. Each axis represents a different variable, and the values are plotted along each axis and connected to form a polygon.\n\nThe radar chart in the image is labeled with various acronyms such as "MM-Vet," "LLaVA-Bench," "SEED-Bench," "MMBench-CN," "MMBench," "TextVQA," "POPE," "BLIP-2," "InstructionBLIP," "Owen-VL-Chat," and "LLaVA-1.5." These labels likely represent different benchmarks or models used in a particular context, possibly in the field of natural language processing or a related area of artificial intelligence.\n\nThe radar chart is color-coded, with different colors representing different models or benchmarks. The chart is overlaid with a blue background that seems to be a stylized representation of water, giving the impression that the radar chart is underwater.\n\nThe image on the right shows a person standing in what appears to be a body of water, possibly a pool or a shallow sea, given the clear visibility and the presence of bubbles. The person is wearing a black shirt and dark pants, and they are looking directly at the camera with a neutral expression. The water around them is a bright blue, and there are bubbles visible, suggesting that the water is clear and possibly that the person is underwater. The image is a photograph and has a realistic style.'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants