Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature-Request] Support for GPT Vision #624

Open
antoan opened this issue Mar 10, 2024 · 8 comments
Open

[Feature-Request] Support for GPT Vision #624

antoan opened this issue Mar 10, 2024 · 8 comments
Labels
enhancement New feature or request

Comments

@antoan
Copy link

antoan commented Mar 10, 2024

I have already tried a solution suggested by @truebit, for this, earlier, without any luck - I have documented my attempt a little here:

Originally posted by @antoan in #459 (comment)

@antoan antoan changed the title Support for GPT Vision [Feature-Request] Support for GPT Vision Mar 10, 2024
@arnavsinghvi11
Copy link
Collaborator

Hi @antoan , I believe you would have to pass the path through an image_url argument. Feel free to follow the openAI docs on vision for reference in passing the proper configuration.

@jamesschinnerplxs
Copy link

I have the same desire to try this out with images, looking at the GPT class, one current restriction I see is that the __call__ method only takes a single string parameter of 'prompt', which gets passes as a single 'message' (in OpenAI terms) to the LM: [{"role": "user", "content": prompt}]

A way forward would be to allow GPT to accept a list of 'prompts' which are type annotated somehow, and dynamically construct the messages based on those types, ie 'inline base64 image' or a image_url.

Not sure if these is existing machinery with DSPy which can help with this (i've really only looked at this for 1hr or so... including reading the docs!)

@jmanhype
Copy link

I have the same desire to try this out with images, looking at the GPT class, one current restriction I see is that the __call__ method only takes a single string parameter of 'prompt', which gets passes as a single 'message' (in OpenAI terms) to the LM: [{"role": "user", "content": prompt}]

A way forward would be to allow GPT to accept a list of 'prompts' which are type annotated somehow, and dynamically construct the messages based on those types, ie 'inline base64 image' or a image_url.

Not sure if these is existing machinery with DSPy which can help with this (i've really only looked at this for 1hr or so... including reading the docs!)

Maybe start here https://github.com/stanfordnlp/dspy/blob/main/dsp/modules/lm.py

@thomasahle thomasahle added the enhancement New feature or request label Mar 18, 2024
@jmanhype
Copy link

#675

I have made a PR

@rawwerks
Copy link

@dat-boris
Copy link
Contributor

I am also in need of this feature, started with this at
#1099

Tested this with both Gemini and GPT-4o. If anybody is interested, welcome to try it out!

@ZhijieXiong
Copy link

I have also been looking for a solution to this problem, and saw that someone had written a GPT4Vision class (https://github.com/stanfordnlp/dspy/blob/56a0949ad285e0a3dd5649de58a6f5fb6f734a60/dsp/modules/gpt4vision.py#L106C1-L147C20). But that one is too complicated. This is a simple solution I wrote based on the documentation, and it has actually been tested and is feasible:

import dspy
import requests
import base64
import re
from dsp import LM


def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')


class GPTVision(LM):
    def __init__(self, model, api_key):
        super().__init__("gpt-4o")
        self.model = model
        self.api_key = api_key
        self.provider = "openai"

        self.history = []
        self.base_url = "https://api.openai.com/v1/chat/completions"

    def basic_request(self, prompt, **kwargs):
        # 目前是从prompt中把图片文件的位置取出来,然后再从prompt把这一部分删除
        pattern = r'^Image Path: .*'
        matches = re.findall(pattern, prompt, re.MULTILINE)

        image_path = matches[1].replace("Image Path: ", "")
        for match in matches:
            prompt = prompt.replace(f"\n{match}\n", "")

        headers = {
            "Content-Type": "application/json",
            "Authorization": f"Bearer {self.api_key}"
        }
        base64_image = encode_image(image_path)

        data = {
            **kwargs,
            "model": "gpt-4o",
            "messages": [
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "text",
                            "text": prompt
                        },
                        {
                            "type": "image_url",
                            "image_url": {
                                "url": f"data:image/jpeg;base64,{base64_image}"
                            }
                        }
                    ]
                }
            ],
            "max_tokens": 300
        }
        response = requests.post(self.base_url, headers=headers, json=data)
        response = response.json()

        self.history.append({
            "prompt": prompt,
            "response": response,
            "kwargs": kwargs
        })

        return response

    def __call__(self, prompt, only_completed=True, return_sorted=False, **kwargs):
        responses = self.request(prompt, **kwargs)
        completions = [choice["message"]["content"] for choice in responses["choices"]]

        return completions


class VqaCoT(dspy.Signature):
    """Answer the questions based on the pictures."""

    image_path = dspy.InputField(desc="Base64 format of the image")
    question = dspy.InputField()
    answer = dspy.OutputField(desc="Answer based on image and question")


if __name__ == "__main__":
    gpt4o = GPTVision(model='gpt-4o', api_key="your api key of openai")
    qa = dspy.ChainOfThought(VqaCoT)
    with dspy.context(lm=gpt4o):
        print(qa(question="What is the occupation of the man in the picture?",
                 image_path="/Users/dream/myProjects/ITS-llm/demo/curry.jpeg"))
        # Prediction(
        #     rationale='Question: What is the occupation of the man in the picture?\n\nReasoning: Let\'s think step by step in order to determine his occupation. We observe that he is wearing a basketball jersey with the team name "Golden State Warriors" and holding a basketball. These indicators suggest that he is likely employed in a profession related to basketball.',
        #     answer='The man is a basketball player.'
        # )

@aliirz
Copy link

aliirz commented Jul 20, 2024

I have also been looking for a solution to this problem, and saw that someone had written a GPT4Vision class (https://github.com/stanfordnlp/dspy/blob/56a0949ad285e0a3dd5649de58a6f5fb6f734a60/dsp/modules/gpt4vision.py#L106C1-L147C20). But that one is too complicated. This is a simple solution I wrote based on the documentation, and it has actually been tested and is feasible:

import dspy
import requests
import base64
import re
from dsp import LM


def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')


class GPTVision(LM):
    def __init__(self, model, api_key):
        super().__init__("gpt-4o")
        self.model = model
        self.api_key = api_key
        self.provider = "openai"

        self.history = []
        self.base_url = "https://api.openai.com/v1/chat/completions"

    def basic_request(self, prompt, **kwargs):
        # 目前是从prompt中把图片文件的位置取出来,然后再从prompt把这一部分删除
        pattern = r'^Image Path: .*'
        matches = re.findall(pattern, prompt, re.MULTILINE)

        image_path = matches[1].replace("Image Path: ", "")
        for match in matches:
            prompt = prompt.replace(f"\n{match}\n", "")

        headers = {
            "Content-Type": "application/json",
            "Authorization": f"Bearer {self.api_key}"
        }
        base64_image = encode_image(image_path)

        data = {
            **kwargs,
            "model": "gpt-4o",
            "messages": [
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "text",
                            "text": prompt
                        },
                        {
                            "type": "image_url",
                            "image_url": {
                                "url": f"data:image/jpeg;base64,{base64_image}"
                            }
                        }
                    ]
                }
            ],
            "max_tokens": 300
        }
        response = requests.post(self.base_url, headers=headers, json=data)
        response = response.json()

        self.history.append({
            "prompt": prompt,
            "response": response,
            "kwargs": kwargs
        })

        return response

    def __call__(self, prompt, only_completed=True, return_sorted=False, **kwargs):
        responses = self.request(prompt, **kwargs)
        completions = [choice["message"]["content"] for choice in responses["choices"]]

        return completions


class VqaCoT(dspy.Signature):
    """Answer the questions based on the pictures."""

    image_path = dspy.InputField(desc="Base64 format of the image")
    question = dspy.InputField()
    answer = dspy.OutputField(desc="Answer based on image and question")


if __name__ == "__main__":
    gpt4o = GPTVision(model='gpt-4o', api_key="your api key of openai")
    qa = dspy.ChainOfThought(VqaCoT)
    with dspy.context(lm=gpt4o):
        print(qa(question="What is the occupation of the man in the picture?",
                 image_path="/Users/dream/myProjects/ITS-llm/demo/curry.jpeg"))
        # Prediction(
        #     rationale='Question: What is the occupation of the man in the picture?\n\nReasoning: Let\'s think step by step in order to determine his occupation. We observe that he is wearing a basketball jersey with the team name "Golden State Warriors" and holding a basketball. These indicators suggest that he is likely employed in a profession related to basketball.',
        #     answer='The man is a basketball player.'
        # )

Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

9 participants