Skip to content

Commit

Permalink
Update Dev Guides
Browse files Browse the repository at this point in the history
  • Loading branch information
kennymckormick committed Jun 21, 2024
1 parent ab638e4 commit 4aaea05
Showing 1 changed file with 41 additions and 2 deletions.
43 changes: 41 additions & 2 deletions Development.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,9 +35,48 @@ Besides, your dataset class **should implement the method `build_prompt(self, li

Example PR: **Support Monkey** ([#45](https://github.com/open-compass/VLMEvalKit/pull/45/files))

All existing models are implemented in `vlmeval/vlm`. For a minimal model, your model class **should implement the method** `generate(image_path, prompt, dataset=None)`. In this function, you feed the image and prompt to your VLM and return the VLM prediction (which is a string). The optional argument `dataset` can be used as the flag for the model to switch among various inference strategies.
All existing models are implemented in `vlmeval/vlm`. For a minimal model, your model class **should implement the method** `generate(msgs, dataset=None)`. In this function, you feed a multi-modal message to your VLM and return the VLM prediction (which is a string). The optional argument `dataset` can be used as the flag for the model to switch among various inference strategies.

Besides, your model can support custom prompt building by implementing an optional method `build_prompt(line, dataset=None)`. In this function, the line is a dictionary that includes the necessary information of a data sample, while `dataset` can be used as the flag for the model to switch among various prompt building strategies.
The multi-modal messages `msgs` is a list of dictionaries, each dictionary has two keys: type and value:
- `type`: We currently support two types, choices are ["image", "text"].
- `value`: When type=='text' , the value is the text message (a single string); when type=='image', the value can be the local path of an image file, or the image URL.

Currently a multi-modal message may contain arbitarily interleaved images and texts. If your model do not support that, our recommended practice is to take the first image and concatenated text messages as the input to the model.

Here are some examples of multi-modal messages:

```python
IMAGE_PTH = 'assets/apple.jpg'
IMAGE_URL = 'https://raw.githubusercontent.com/open-compass/VLMEvalKit/main/assets/apple.jpg'
msg1 = [
dict(type='image', value=IMAGE_PTH),
dict(type='text', value='What is in this image?')
]
msg2 = [
dict(type='image', value=IMAGE_URL),
dict(type='image', value=IMAGE_URL),
dict(type='text', value='How many apples are there in these images?')
]
response = model.generate(msg1)
```

For convenience sake, we also support to take a list of string as inputs. In that case, we will check if a string is an image path or image URL and automatically convert it to the list[dict] format:

```python
IMAGE_PTH = 'assets/apple.jpg'
IMAGE_URL = 'https://raw.githubusercontent.com/open-compass/VLMEvalKit/main/assets/apple.jpg'
msg1 = [IMAGE_PTH, 'What is in this image?']
msg2 = [IMAGE_URL, IMAGE_URL, 'How many apples are there in these images?']
response = model.generate(msg1)
```

Besides, your model can support custom prompt building by implementing two optional methods: `use_custom_prompt(dataset)` and `build_prompt(line, dataset=None)`. Both functions take the dataset name as the input. `use_custom_prompt` will return a boolean flag, indicating whether the model should use the custom prompt building strategy. If it returns True, `build_prompt` should return a customly bulit multimodal message for the corresponding `dataset`, given `line`, which is a dictionary that includes the necessary information of a data sample. If it returns False, the default prompt building strategy will be used.

### Example PRs:

- VLM that doesn't support interleaved images and texts, and does not use custom prompts: [[Model] Support glm-4v-9b](https://github.com/open-compass/VLMEvalKit/pull/221)
- VLM that supports interleaved images and texts and custom prompts: [Add MiniCPM-Llama3-V-2.5](https://github.com/open-compass/VLMEvalKit/pull/205)
- VLM API: [Feature add glmv](https://github.com/open-compass/VLMEvalKit/pull/201)

## Contribute to VLMEvalKit

Expand Down

0 comments on commit 4aaea05

Please sign in to comment.