Skip to content

Commit

Permalink
Merge pull request microsoft#56 from MingxuanXia/mingxuanxia/multimodal
Browse files Browse the repository at this point in the history
Add multi modal evaluations
  • Loading branch information
jindongwang committed Mar 13, 2024
2 parents 4c68eb4 + e17968d commit e856221
Show file tree
Hide file tree
Showing 15 changed files with 1,949 additions and 10 deletions.
26 changes: 21 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@
<!-- News and Updates -->

## News and Updates
- [13/03/2024] Add support for multi-modal models and datasets.
- [05/01/2024] Add support for BigBench Hard, DROP, ARC datasets.
- [16/12/2023] Add support for Gemini, Mistral, Mixtral, Baichuan, Yi models.
- [15/12/2023] Add detailed instructions for users to add new modules (models, datasets, etc.) [examples/add_new_modules.md](examples/add_new_modules.md).
Expand Down Expand Up @@ -161,7 +162,7 @@ import promptbench as pb

We provide tutorials for:

1. **evaluate models on existing benchmarks:** please refer to the [examples/basic.ipynb](examples/basic.ipynb) for constructing your evaluation pipeline.
1. **evaluate models on existing benchmarks:** please refer to the [examples/basic.ipynb](examples/basic.ipynb) for constructing your evaluation pipeline. For a multi-modal evaluation pipeline, please refer to [examples/multimodal.ipynb](examples/multimodal.ipynb)
2. **test the effects of different prompting techniques:**
3. **examine the robustness for prompt attacks**, please refer to [examples/prompt_attack.ipynb](examples/prompt_attack.ipynb) to construct the attacks.
4. **use DyVal for evaluation:** please refer to [examples/dyval.ipynb](examples/dyval.ipynb) to construct DyVal datasets.
Expand All @@ -185,6 +186,13 @@ PromptBench currently supports different datasets, models, prompt engineering me
- Numersense
- QASC
- Last Letter Concatenate
- VQAv2
- NoCaps
- MMMU
- MathVista
- AI2D
- ChartQA
- ScienceQA

### Models

Expand All @@ -203,6 +211,18 @@ PromptBench currently supports different datasets, models, prompt engineering me
- GPT-4
- Gemini Pro

### Models (Multi-Modal)

- Open-source models:
- BLIP2
- LLaVA
- Qwen-VL, Qwen-VL-Chat
- InternLM-XComposer2-VL
- Proprietary models
- GPT-4v
- GeminiProVision
- Qwen-VL-Max, Qwen-VL-Plus

### Prompt Engineering

- Chain-of-thought (COT) [1]
Expand Down Expand Up @@ -239,10 +259,6 @@ PromptBench currently supports different datasets, models, prompt engineering me

Please refer to our [benchmark website](https://llm-eval.github.io/) for benchmark results on Prompt Attacks, Prompt Engineering and Dynamic Evaluation DyVal.

## TODO

- [ ] Add support for multi-modal models such as LlaVa and BLIP2.

## Acknowledgements

- [TextAttack](https://github.com/QData/TextAttack)
Expand Down
134 changes: 134 additions & 0 deletions docs/examples/multimodal.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
# Multi-Modal Models

This example will walk you throught the basic usage of MULTI-MODAL models in PromptBench. We hope that you can get familiar with the APIs and use it in your own projects later.

First, there is a unified import of `import promptbench as pb` that easily imports the package.


```python
import promptbench as pb
```

/anaconda/envs/promptbench_1/lib/python3.9/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm


## Load dataset

First, PromptBench supports easy load of datasets.


```python
# print all supported datasets in promptbench
print('All supported datasets: ')
print(pb.SUPPORTED_DATASETS_VLM)

# load a dataset, MMMMU, for instance.
# if the dataset is not available locally, it will be downloaded automatically.
dataset = pb.DatasetLoader.load_dataset("mmmu")

# print the first 5 examples
dataset[:5]
```

All supported datasets:
['vqav2', 'nocaps', 'science_qa', 'math_vista', 'ai2d', 'mmmu', 'chart_qa']





[{'images': [<PIL.PngImagePlugin.PngImageFile image mode=RGBA size=733x237>],
'answer': 'B',
'question': '<image 1> Baxter Company has a relevant range of production between 15,000 and 30,000 units. The following cost data represents average variable costs per unit for 25,000 units of production. If 30,000 units are produced, what are the per unit manufacturing overhead costs incurred?\nA: $6\nB: $7\nC: $8\nD: $9'},
{'images': [<PIL.PngImagePlugin.PngImageFile image mode=RGBA size=342x310>],
'answer': 'C',
'question': 'Assume accounts have normal balances, solve for the one missing account balance: Dividends. Equipment was recently purchased, so there is neither depreciation expense nor accumulated depreciation. <image 1>\nA: $194,815\nB: $182,815\nC: $12,000\nD: $9,000'},
{'images': [<PIL.PngImagePlugin.PngImageFile image mode=RGBA size=336x169>],
'answer': 'B',
'question': 'Maxwell Software, Inc., has the following mutually exclusive projects.Suppose the company uses the NPV rule to rank these two projects.<image 1> Which project should be chosen if the appropriate discount rate is 15 percent?\nA: Project A\nB: Project B'},
{'images': [<PIL.PngImagePlugin.PngImageFile image mode=RGBA size=1222x237>],
'answer': 'D',
'question': "Each situation below relates to an independent company's Owners' Equity. <image 1> Calculate the missing values of company 2.\nA: $1,620\nB: $12,000\nC: $51,180\nD: $0"},
{'images': [<PIL.PngImagePlugin.PngImageFile image mode=RGBA size=1219x217>],
'answer': 'B',
'question': 'The following data show the units in beginning work in process inventory, the number of units started, the number of units transferred, and the percent completion of the ending work in process for conversion. Given that materials are added at the beginning of the process, what are the equivalent units for conversion costs for each quarter using the weighted-average method? Assume that the quarters are independent.<image 1>\nA: 132,625\nB: 134,485\nC: 135,332\nD: 132,685'}]



## Load models

Then, you can easily load VLM models via promptbench.


```python
# print all supported models in promptbench
print('All supported models: ')
print(pb.SUPPORTED_MODELS_VLM)

# load a model, llava-1.5-7b, for instance.
model = pb.VLMModel(model='llava-hf/llava-1.5-7b-hf', max_new_tokens=2048, temperature=0.0001, device='cuda')
```

All supported models:
['Salesforce/blip2-opt-2.7b', 'Salesforce/blip2-opt-6.7b', 'Salesforce/blip2-flan-t5-xl', 'Salesforce/blip2-flan-t5-xxl', 'llava-hf/llava-1.5-7b-hf', 'llava-hf/llava-1.5-13b-hf', 'gemini-pro-vision', 'gpt-4-vision-preview', 'Qwen/Qwen-VL', 'Qwen/Qwen-VL-Chat', 'qwen-vl-plus', 'qwen-vl-max', 'internlm/internlm-xcomposer2-vl-7b']


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|██████████| 3/3 [00:04<00:00, 1.48s/it]


## Construct prompts

Prompts are the key interaction interface to VLMs. You can easily construct a prompt by call the Prompt API.


```python
# Prompt API supports a list, so you can pass multiple prompts at once.
prompts = pb.Prompt([
"You are a helpful assistant. Here is the question:{question}\nANSWER:",
"USER:{question}\nANSWER:",
])
```

## Perform evaluation using prompts, datasets, and models

Finally, you can perform standard evaluation using the loaded prompts, datasets, and labels.


```python
from tqdm import tqdm
for prompt in prompts:
preds = []
labels = []
for data in tqdm(dataset):
# process input
input_text = pb.InputProcess.basic_format(prompt, data)
input_images = data['images']
label = data['answer']
raw_pred = model(input_images, input_text)
# process output
pred = pb.OutputProcess.pattern_split(raw_pred, 'ANSWER:')
preds.append(pred)
labels.append(label)

# evaluate
score = pb.Eval.compute_cls_accuracy(preds, labels)
print(f"{score:.3f}, {repr(prompt)}")
```

0%| | 0/900 [00:00<?, ?it/s]

100%|██████████| 900/900 [17:35<00:00, 1.17s/it]


0.333, 'You are a helpful assistant. Here is the question:{question}\nANSWER:'


100%|██████████| 900/900 [17:27<00:00, 1.16s/it]

0.316, 'USER:{question}\nANSWER:'




1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ Welcome to promptbench's documentation!
:caption: Examples

examples/basic
examples/multimodal
examples/dyval
examples/prompt_attack
examples/prompt_engineering
Expand Down
2 changes: 1 addition & 1 deletion docs/start/intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@

## Where should I get started?
If you want to
1. **evaluate my model on existing benchmarks:** please refer to the `examples/basic.ipynb` for constructing your evaluation pipeline.
1. **evaluate my model on existing benchmarks:** please refer to the `examples/basic.ipynb` for constructing your evaluation pipeline. For a multi-modal evaluation pipeline, please refer to `examples/multimodal.ipynb`.
2. **test the effects of different prompting techniques:**
3. **examine the robustness for prompt attacks**, please refer to `examples/prompt_attack.ipynb` to construct the attacks.
4. **use DyVal for evaluation:** please refer to `examples/dyval.ipynb` to construct DyVal datasets.
Expand Down
Loading

0 comments on commit e856221

Please sign in to comment.