diff --git a/README.md b/README.md
index 9b09a664..40fa8366 100644
--- a/README.md
+++ b/README.md
@@ -25,6 +25,8 @@ English | [็ฎไฝไธญๆ]
## ๐ News
+- **[2024-06-18]** We have supported [**MMT-Bench**](https://mmt-bench.github.io), thanks to [**KainingYing**](https://github.com/KainingYing)๐ฅ๐ฅ๐ฅ
+- **[2024-06-05]** We have supported [**WeMM**](https://github.com/scenarios/WeMM), thanks to [**scenarios**](https://github.com/scenarios)๐ฅ๐ฅ๐ฅ
- **[2024-05-27]** We have supported [**Mini InternVL**](https://huggingface.co/OpenGVLab/Mini-InternVL-Chat-2B-V1-5), thanks to [**czczup**](https://github.com/czczup)๐ฅ๐ฅ๐ฅ
- **[2024-05-25]** We have supported [**SEEDBench2_Plus**](https://arxiv.org/abs/2404.16790), thanks to [**Bohao-Lee**](https://github.com/Bohao-Lee)๐ฅ๐ฅ๐ฅ
- **[2024-05-24]** We have supported [**Phi-3-Vision**](https://huggingface.co/microsoft/Phi-3-vision-128k-instruct) and [**CogVLM2-Llama3-chat**](https://huggingface.co/THUDM/cogvlm2-llama3-chat-19B) ๐ฅ๐ฅ๐ฅ
@@ -33,8 +35,6 @@ English | [็ฎไฝไธญๆ]
- **[2024-05-15]** We have supported [**PaliGemma-3B**](https://huggingface.co/google/paligemma-3b-pt-448), a versatile and lightweight vision-language model released by Google ๐ฅ๐ฅ๐ฅ
- **[2024-05-14]** We have supported [**GPT-4o**](https://openai.com/index/hello-gpt-4o/) ๐ฅ๐ฅ๐ฅ
- **[2024-05-07]** We have supported [**XVERSE-V-13B**](https://github.com/xverse-ai/XVERSE-V-13B/blob/main/vxverse/models/vxverse.py), thanks to [**YJY123**](https://github.com/YJY123) ๐ฅ๐ฅ๐ฅ
-- **[2024-05-06]** We have launched a discord channel for VLMEvalKit users: https://discord.gg/evDT4GZmxN. Latest updates and discussion will be posted here
-- **[2024-05-06]** We have supported 2 VLMs based on Llama3 ๐ฅ๐ฅ๐ฅ: Bunny-llama3-8B (SigLIP, image size 384) and llava-llama-3-8b (CLIP-L, image size 336), you can now evaluate both models on dozens of datasets we supported
## ๐ Datasets, Models, and Evaluation Results
@@ -50,7 +50,7 @@ English | [็ฎไฝไธญๆ]
| ------------------------------------------------------------ | ------------------------------------------------------ | --------- | --------- | --------- | --------- |
| [**MMBench Series**](https://github.com/open-compass/mmbench/):
MMBench, MMBench-CN, CCBench | MMBench_DEV_[EN/CN]
MMBench_TEST_[EN/CN]
MMBench_DEV_[EN/CN]_V11
MMBench_TEST_[EN/CN]_V11
CCBench | Multi-choice
Question (MCQ) | [**MMStar**](https://github.com/MMStar-Benchmark/MMStar) | MMStar | MCQ |
| [**MME**](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation) | MME | Yes or No (Y/N) | [**SEEDBench Series**](https://github.com/AILab-CVC/SEED-Bench) | SEEDBench_IMG, SEEDBench2_Plus | MCQ |
-| [**MM-Vet**](https://github.com/yuweihao/MM-Vet) | MMVet | VQA | [**MMMU**](https://mmmu-benchmark.github.io) | MMMU_DEV_VAL/MMMU_TEST | MCQ |
+| [**MM-Vet**](https://github.com/yuweihao/MM-Vet) | MMVet | VQA | [**MMMU**](https://mmmu-benchmark.github.io) | MMMU_[DEV_VAL/TEST] | MCQ |
| [**MathVista**](https://mathvista.github.io) | MathVista_MINI | VQA | [**ScienceQA_IMG**](https://scienceqa.github.io) | ScienceQA_[VAL/TEST] | MCQ |
| [**COCO Caption**](https://cocodataset.org) | COCO_VAL | Caption | [**HallusionBench**](https://github.com/tianyi-lab/HallusionBench) | HallusionBench | Y/N |
| [**OCRVQA**](https://ocr-vqa.github.io)* | OCRVQA_[TESTCORE/TEST] | VQA | [**TextVQA**](https://textvqa.org)* | TextVQA_VAL | VQA |
@@ -59,6 +59,7 @@ English | [็ฎไฝไธญๆ]
| [**InfoVQA**](https://www.docvqa.org/datasets/infographicvqa)+ | InfoVQA_[VAL/TEST] | VQA | [**OCRBench**](https://github.com/Yuliang-Liu/MultimodalOCR) | OCRBench | VQA |
| [**RealWorldQA**](https://x.ai/blog/grok-1.5v) | RealWorldQA | MCQ | [**POPE**](https://github.com/AoiDragon/POPE) | POPE | Y/N |
| [**Core-MM**](https://github.com/core-mm/core-mm)- | CORE_MM | VQA | [**SEEDBench2_Plus**](https://arxiv.org/abs/2404.16790) | SEEDBench2_Plus | MCQ |
+| [**MMT-Bench**](https://mmt-bench.github.io) | MMT-Bench_[VAL/VAL_MI/ALL/ALL_MI] | MCQ | | | |
**\*** We only provide a subset of the evaluation results, since some VLMs do not yield reasonable results under the zero-shot setting
@@ -80,11 +81,11 @@ VLMEvalKit will use an **judge LLM** to extract answer from the output if you se
| ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |
| [**mPLUG-Owl2**](https://github.com/X-PLUG/mPLUG-Owl/tree/main/mPLUG-Owl2)๐๏ธ | [**OpenFlamingo-v2**](https://github.com/mlfoundations/open_flamingo)๐๏ธ | [**PandaGPT-13B**](https://github.com/yxuansu/PandaGPT) | [**Qwen-VL**](https://huggingface.co/Qwen/Qwen-VL)๐๏ธ๐
, [**Qwen-VL-Chat**](https://huggingface.co/Qwen/Qwen-VL-Chat)๐๏ธ**๐
** |
| [**VisualGLM-6B**](https://huggingface.co/THUDM/visualglm-6b)๐
| [**InternLM-XComposer-7B**](https://huggingface.co/internlm/internlm-xcomposer-7b)๐
๐๏ธ | [**ShareGPT4V-[7B/13B]**](https://sharegpt4v.github.io)๐
| [**TransCore-M**](https://github.com/PCIResearch/TransCore-M) |
-| [**LLaVA (XTuner)**](https://huggingface.co/xtuner/llava-internlm-7b)๐
| [**CogVLM-[Chat/Llama3]**](https://huggingface.co/THUDM/cogvlm-chat-hf)๐
| [**SharedCaptioner**](https://huggingface.co/spaces/Lin-Chen/Share-Captioner)๐
| [**CogVLM-Grounding-Generalist**](https://huggingface.co/THUDM/cogvlm-grounding-generalist-hf)๐
|
+| [**LLaVA (XTuner)**](https://huggingface.co/xtuner/llava-internlm-7b)๐
| [**CogVLM-[Chat/Llama3]**](https://huggingface.co/THUDM/cogvlm-chat-hf)๐
| [**ShareCaptioner**](https://huggingface.co/spaces/Lin-Chen/Share-Captioner)๐
| [**CogVLM-Grounding-Generalist**](https://huggingface.co/THUDM/cogvlm-grounding-generalist-hf)๐
|
| [**Monkey**](https://github.com/Yuliang-Liu/Monkey)๐
, [**Monkey-Chat**](https://github.com/Yuliang-Liu/Monkey)๐
| [**EMU2-Chat**](https://github.com/baaivision/Emu)๐
๐๏ธ | [**Yi-VL-[6B/34B]**](https://huggingface.co/01-ai/Yi-VL-6B) | [**MMAlaya**](https://huggingface.co/DataCanvas/MMAlaya)๐
|
-| [**InternLM-XComposer2-[1.8B/7B]**](https://huggingface.co/internlm/internlm-xcomposer2-vl-7b)๐
๐๏ธ | [**MiniCPM-[V1/V2/V2.5]**](https://huggingface.co/openbmb/MiniCPM-V)๐
| [**OmniLMM-12B**](https://huggingface.co/openbmb/OmniLMM-12B) | [**InternVL-Chat-[V1-1/V1-2/V1-2-Plus/V1-5]**](https://github.com/OpenGVLab/InternVL)๐
, [**Mini-InternVL-Chat-2B-V1-5**](https://github.com/OpenGVLab/InternVL)๐
|
+| [**InternLM-XComposer2-[1.8B/7B]**](https://huggingface.co/internlm/internlm-xcomposer2-vl-7b)๐
๐๏ธ | [**MiniCPM-[V1/V2/V2.5]**](https://huggingface.co/openbmb/MiniCPM-V)๐
| [**OmniLMM-12B**](https://huggingface.co/openbmb/OmniLMM-12B) | [**InternVL-Chat-[V1-1/V1-2/V1-2-Plus/V1-5]**](https://github.com/OpenGVLab/InternVL)๐
,
[**Mini-InternVL-Chat-2B-V1-5**](https://github.com/OpenGVLab/InternVL)๐
|
| [**DeepSeek-VL**](https://github.com/deepseek-ai/DeepSeek-VL/tree/main)๐๏ธ | [**LLaVA-NeXT**](https://llava-vl.github.io/blog/2024-01-30-llava-next/)๐
| [**Bunny-Llama3**](https://huggingface.co/BAAI/Bunny-Llama-3-8B-V)๐
| [**XVERSE-V-13B**](https://github.com/xverse-ai/XVERSE-V-13B/blob/main/vxverse/models/vxverse.py) |
-| [**PaliGemma-3B**](https://huggingface.co/google/paligemma-3b-pt-448) ๐
| [**360VL-70B**](https://huggingface.co/qihoo360/360VL-70B) ๐
| [**Phi-3-Vision**](https://huggingface.co/microsoft/Phi-3-vision-128k-instruct)๐
| |
+| [**PaliGemma-3B**](https://huggingface.co/google/paligemma-3b-pt-448) ๐
| [**360VL-70B**](https://huggingface.co/qihoo360/360VL-70B) ๐
| [**Phi-3-Vision**](https://huggingface.co/microsoft/Phi-3-vision-128k-instruct)๐
| [**WeMM**](https://github.com/scenarios/WeMM)๐
|
๐๏ธ: Support multiple images as inputs.
@@ -94,9 +95,10 @@ VLMEvalKit will use an **judge LLM** to extract answer from the output if you se
Note that some VLMs may not be able to run under certain transformer versions, we recommend the following settings to evaluate each VLM:
-- **Please use** `transformers==4.33.0` **for**: `Qwen series`, `Monkey series`, `InternLM-XComposer Series`, `mPLUG-Owl2`, `OpenFlamingo v2`, `IDEFICS series`, `VisualGLM`, `MMAlaya`, `SharedCaptioner`, `MiniGPT-4 series`, `InstructBLIP series`, `PandaGPT`, `VXVERSE`.
-- **Please use** `transformers==4.37.0` **for**: `LLaVA series`, `ShareGPT4V series`, `TransCore-M`, `LLaVA (XTuner)`, `CogVLM Series`, `EMU2 Series`, `Yi-VL Series`, `MiniCPM-V (v1, v2)`, `OmniLMM-12B`, `DeepSeek-VL series`, `InternVL series`.
-- **Please use** `transformers==4.40.0` **for**: `IDEFICS2`, `Bunny-Llama3`, `MiniCPM-Llama3-V2.5`, `LLaVA-Next series`, `360VL-70B`, `Phi-3-Vision`.
+- **Please use** `transformers==4.33.0` **for**: `Qwen series`, `Monkey series`, `InternLM-XComposer Series`, `mPLUG-Owl2`, `OpenFlamingo v2`, `IDEFICS series`, `VisualGLM`, `MMAlaya`, `ShareCaptioner`, `MiniGPT-4 series`, `InstructBLIP series`, `PandaGPT`, `VXVERSE`.
+- **Please use** `transformers==4.37.0` **for**: `LLaVA series`, `ShareGPT4V series`, `TransCore-M`, `LLaVA (XTuner)`, `CogVLM Series`, `EMU2 Series`, `Yi-VL Series`, `MiniCPM-[V1/V2]`, `OmniLMM-12B`, `DeepSeek-VL series`, `InternVL series`.
+- **Please use** `transformers==4.40.0` **for**: `IDEFICS2`, `Bunny-Llama3`, `MiniCPM-Llama3-V2.5`, `LLaVA-Next series`, `360VL-70B`, `Phi-3-Vision`, `WeMM`.
+- **Please use** `transformers==latest` **for**: `PaliGemma-3B`.
```python
# Demo
diff --git a/README_zh-CN.md b/README_zh-CN.md
index 7bb1d15a..cc808b5e 100644
--- a/README_zh-CN.md
+++ b/README_zh-CN.md
@@ -23,6 +23,8 @@
## ๐ ๆดๆฐ
+- **[2024-06-18]** ๆฏๆไบ [**MMT-Bench**](https://mmt-bench.github.io)๏ผๆ่ฐข [**KainingYing**](https://github.com/KainingYing)๐ฅ๐ฅ๐ฅ
+- **[2024-06-05]** ๆฏๆไบ [**WeMM**](https://github.com/scenarios/WeMM)๏ผๆ่ฐข [**scenarios**](https://github.com/scenarios)๐ฅ๐ฅ๐ฅ
- **[2024-05-27]** ๆฏๆไบ [**Mini InternVL**](https://huggingface.co/OpenGVLab/Mini-InternVL-Chat-2B-V1-5), ๆ่ฐข [**czczup**](https://github.com/czczup)๐ฅ๐ฅ๐ฅ
- **[2024-05-25]** ๆฏๆไบ [**SEEDBench2_Plus**](https://arxiv.org/abs/2404.16790)๏ผๆ่ฐข [**Bohao-Lee**](https://github.com/Bohao-Lee)๐ฅ๐ฅ๐ฅ
- **[2024-05-24]** ๆฏๆไบ [**Phi-3-Vision**](https://huggingface.co/microsoft/Phi-3-vision-128k-instruct) ๅ [**CogVLM2-Llama3-chat**](https://huggingface.co/THUDM/cogvlm2-llama3-chat-19B) ๐ฅ๐ฅ๐ฅ
@@ -31,8 +33,6 @@
- **[2024-05-15]** ๆฏๆไบ [**PaliGemma-3B**](https://huggingface.co/google/paligemma-3b-pt-448), ไธไธช่ฐทๆญๅผๆบ็ 3B ๅคๆจกๆๆจกๅใ ๐ฅ๐ฅ๐ฅ
- **[2024-05-14]** ๆฏๆไบ [**GPT-4o**](https://openai.com/index/hello-gpt-4o/) ๐ฅ๐ฅ๐ฅ
- **[2024-05-07]** ๆฏๆไบ [**XVERSE-V-13B**](https://github.com/xverse-ai/XVERSE-V-13B/blob/main/vxverse/models/vxverse.py), ๆ่ฐข [**YJY123**](https://github.com/YJY123) ๐ฅ๐ฅ๐ฅ
-- **[2024-05-06]** ๆ็ซไบ VLMEvalKit ็จๆท็พค็ป็ Discord ้ข้: https://discord.gg/evDT4GZmxN๏ผๅฐๅจ่ฟ้ๅไบซๅ
ณไบ VLMEvalKit ็ๆดๆฐๅนถ่ฟ่ก่ฎจ่ฎบ
-- **[2024-05-06]** ๆฏๆไบไธคไธชๅบไบ Llama3 ็ VLM ๐ฅ๐ฅ๐ฅ: Bunny-llama3-8B (SigLIP, ่พๅ
ฅๅพๅๅคงๅฐ 384) ๅ llava-llama-3-8b (CLIP-L, ่พๅ
ฅๅพๅๅคงๅฐ 336), ็จๆทๅฏๅจๆไปฌๆฏๆ็ๆฐๅไธชๆต่ฏๅบๅไธๆต่ฏ่ฟไธคไธชๆจกๅ
## ๐ ่ฏๆต็ปๆ๏ผๆฏๆ็ๆฐๆฎ้ๅๆจกๅ
### ่ฏๆต็ปๆ
@@ -56,6 +56,7 @@
| [**InfoVQA**](https://www.docvqa.org/datasets/infographicvqa)+ | InfoVQA_[VAL/TEST] | VQA | [**OCRBench**](https://github.com/Yuliang-Liu/MultimodalOCR) | OCRBench | VQA |
| [**RealWorldQA**](https://x.ai/blog/grok-1.5v) | RealWorldQA | MCQ | [**POPE**](https://github.com/AoiDragon/POPE) | POPE | Y/N |
| [**Core-MM**](https://github.com/core-mm/core-mm)- | CORE_MM | VQA | [**SEEDBench2_Plus**](https://arxiv.org/abs/2404.16790) | SEEDBench2_Plus | MCQ |
+| [**MMT-Bench**](https://mmt-bench.github.io) | MMT-Bench_[VAL/VAL_MI/ALL/ALL_MI] | MCQ | | | |
**\*** ๆไปฌๅชๆไพไบ้จๅๆจกๅไธ็ๆต่ฏ็ปๆ๏ผๅฉไฝๆจกๅๆ ๆณๅจ zero-shot ่ฎพๅฎไธๆต่ฏๅบๅ็็็ฒพๅบฆ
@@ -82,7 +83,7 @@
| [**Monkey**](https://github.com/Yuliang-Liu/Monkey)๐
, [**Monkey-Chat**](https://github.com/Yuliang-Liu/Monkey)๐
| [**EMU2-Chat**](https://github.com/baaivision/Emu)๐
๐๏ธ | [**Yi-VL-[6B/34B]**](https://huggingface.co/01-ai/Yi-VL-6B) | [**MMAlaya**](https://huggingface.co/DataCanvas/MMAlaya)๐
|
| [**InternLM-XComposer2-[1.8B/7B]**](https://huggingface.co/internlm/internlm-xcomposer2-vl-7b)๐
๐๏ธ | [**MiniCPM-[V1/V2/V2.5]**](https://huggingface.co/openbmb/MiniCPM-V)๐
| [**OmniLMM-12B**](https://huggingface.co/openbmb/OmniLMM-12B) | [**InternVL-Chat-[V1-1/V1-2/V1-2-Plus/V1-5]**](https://github.com/OpenGVLab/InternVL)๐
, [**Mini-InternVL-Chat-2B-V1-5**](https://github.com/OpenGVLab/InternVL)๐
|
| [**DeepSeek-VL**](https://github.com/deepseek-ai/DeepSeek-VL/tree/main)๐๏ธ | [**LLaVA-NeXT**](https://llava-vl.github.io/blog/2024-01-30-llava-next/)๐
| [**Bunny-Llama3**](https://huggingface.co/BAAI/Bunny-Llama-3-8B-V)๐
| [**XVERSE-V-13B**](https://github.com/xverse-ai/XVERSE-V-13B/blob/main/vxverse/models/vxverse.py) |
-| [**PaliGemma-3B**](https://huggingface.co/google/paligemma-3b-pt-448) ๐
| [**360VL-70B**](https://huggingface.co/qihoo360/360VL-70B)๐
| [**Phi-3-Vision**](https://huggingface.co/microsoft/Phi-3-vision-128k-instruct) ๐
| |
+| [**PaliGemma-3B**](https://huggingface.co/google/paligemma-3b-pt-448) ๐
| [**360VL-70B**](https://huggingface.co/qihoo360/360VL-70B)๐
| [**Phi-3-Vision**](https://huggingface.co/microsoft/Phi-3-vision-128k-instruct) ๐
| [**WeMM**](https://github.com/scenarios/WeMM)๐
|
๐๏ธ ่กจ็คบๆฏๆๅคๅพ็่พๅ
ฅใ
@@ -96,7 +97,8 @@
- **่ฏท็จ** `transformers==4.33.0` **ๆฅ่ฟ่ก**: `Qwen series`, `Monkey series`, `InternLM-XComposer Series`, `mPLUG-Owl2`, `OpenFlamingo v2`, `IDEFICS series`, `VisualGLM`, `MMAlaya`, `SharedCaptioner`, `MiniGPT-4 series`, `InstructBLIP series`, `PandaGPT`, `VXVERSE`.
- **่ฏท็จ** `transformers==4.37.0 ` **ๆฅ่ฟ่ก**: `LLaVA series`, `ShareGPT4V series`, `TransCore-M`, `LLaVA (XTuner)`, `CogVLM Series`, `EMU2 Series`, `Yi-VL Series`, `MiniCPM-V (v1, v2)`, `OmniLMM-12B`, `DeepSeek-VL series`, `InternVL series`.
-- **่ฏท็จ** `transformers==4.40.0 ` **ๆฅ่ฟ่ก**: `IDEFICS2`, `Bunny-Llama3`, `MiniCPM-Llama3-V2.5`, `LLaVA-Next series`, `360VL-70B`๏ผ `Phi-3-Vision`.
+- **่ฏท็จ** `transformers==4.40.0 ` **ๆฅ่ฟ่ก**: `IDEFICS2`, `Bunny-Llama3`, `MiniCPM-Llama3-V2.5`, `LLaVA-Next series`, `360VL-70B`๏ผ `Phi-3-Vision`๏ผ`WeMM`.
+- **่ฏท็จ** `transformers==latest` **ๆฅ่ฟ่ก**: `PaliGemma-3B`.
**ๅฆไฝๆต่ฏไธไธช VLM ๆฏๅฆๅฏไปฅๆญฃๅธธ่ฟ่ก:**
diff --git a/run.py b/run.py
index ba3153a4..35a1da3b 100644
--- a/run.py
+++ b/run.py
@@ -4,7 +4,7 @@
from vlmeval.evaluate import *
from vlmeval.inference import infer_data_job
from vlmeval.config import supported_VLM
-from vlmeval.utils import dataset_URLs, DATASET_TYPE, abbr2full, MMMU_result_transfer
+from vlmeval.utils import dataset_URLs, DATASET_TYPE, abbr2full, MMMU_result_transfer, MMTBench_result_transfer
def parse_args():
@@ -85,23 +85,7 @@ def main():
api_nproc=args.nproc,
ignore_failed=args.ignore)
- if rank == 0:
- if dataset_name in ['MMMU_TEST']:
- result_json = MMMU_result_transfer(result_file)
- logger.info(f'Transfer MMMU_TEST result to json for official evaluation, json file saved in {result_json}') # noqa: E501
- continue
-
- if dataset_name in [
- 'MMBench_TEST_CN', 'MMBench_TEST_EN', 'MMBench', 'MMBench_CN'
- 'MMBench_TEST_CN_V11', 'MMBench_TEST_EN_V11', 'MMBench_V11', 'MMBench_CN_V11'
- ]:
- if not MMBenchOfficialServer(dataset_name):
- logger.error(
- f'Can not evaluate {dataset_name} on non-official servers, '
- 'will skip the evaluation. '
- )
- continue
-
+ # Set the judge kwargs first before evaluation or dumping
judge_kwargs = {
'nproc': args.nproc,
'verbose': args.verbose,
@@ -120,6 +104,27 @@ def main():
if 'OPENAI_API_BASE_JUDGE' in os.environ and len(os.environ['OPENAI_API_BASE_JUDGE']):
judge_kwargs['api_base'] = os.environ['OPENAI_API_BASE_JUDGE']
+ if rank == 0:
+ if dataset_name in ['MMMU_TEST']:
+ result_json = MMMU_result_transfer(result_file)
+ logger.info(f'Transfer MMMU_TEST result to json for official evaluation, json file saved in {result_json}') # noqa: E501
+ continue
+ elif 'MMT-Bench_ALL' in dataset_name:
+ submission_file = MMTBench_result_transfer(result_file, **judge_kwargs)
+ logger.info(f'Extract options from prediction of MMT-Bench FULL split for official evaluation (https://eval.ai/web/challenges/challenge-page/2328/overview), submission file saved in {submission_file}') # noqa: E501
+ continue
+
+ if dataset_name in [
+ 'MMBench_TEST_CN', 'MMBench_TEST_EN', 'MMBench', 'MMBench_CN'
+ 'MMBench_TEST_CN_V11', 'MMBench_TEST_EN_V11', 'MMBench_V11', 'MMBench_CN_V11'
+ ]:
+ if not MMBenchOfficialServer(dataset_name):
+ logger.error(
+ f'Can not evaluate {dataset_name} on non-official servers, '
+ 'will skip the evaluation. '
+ )
+ continue
+
if rank == 0 and args.mode == 'all':
if DATASET_TYPE(dataset_name) == 'multi-choice':
dataset_name = 'default' if custom_flag else dataset_name
diff --git a/vlmeval/evaluate/multiple_choice.py b/vlmeval/evaluate/multiple_choice.py
index 50d52bf2..23104d5f 100644
--- a/vlmeval/evaluate/multiple_choice.py
+++ b/vlmeval/evaluate/multiple_choice.py
@@ -8,7 +8,7 @@
INTERNAL = os.environ.get('INTERNAL', 0)
-abbrs = {
+MMB_abbrs = {
'coarse_perception': 'CP',
'finegrained_perception (instance-level)': 'FP-S',
'finegrained_perception (cross-instance)': 'FP-C',
@@ -17,6 +17,41 @@
'attribute_reasoning': 'AR'
}
+MMT_abbrs = {
+ 'visual_recognition': 'VR',
+ 'localization': 'Loc',
+ 'ocr': 'OCR',
+ 'counting': 'Count',
+ 'hallucination': 'HLN',
+ 'image_retrieval': 'IR',
+ 'threed': '3D',
+ 'visual_captioning': 'VC',
+ 'visual_grounding': 'VG',
+ 'doc_understanding': 'DU',
+ 'action_recognition': 'AR',
+ 'pixel_level_perception': 'PLP',
+ 'image-to-image_translation': 'I2IT',
+ 'relation_reasoning': 'RR',
+ 'intelligence_quotient_test': 'IQT',
+ 'emotion': 'Emo',
+ 'visual_illusion': 'VI',
+ 'meme_understanding': 'MemU',
+ 'visual_prompt_understanding': 'VPU',
+ 'anomaly_detection': 'AND',
+ 'keypoint_detection': 'KD',
+ 'visual_commonsense_reasoning': 'VCR',
+ 'image_evaluation_judgement': 'IEJ',
+ 'multiple_image_analysis': 'MIA',
+ 'cross_image_matching': 'CIM',
+ 'temporal_understanding': 'TU',
+ 'visual_code': 'VP',
+ 'medical_understanding': 'MedU',
+ 'autonomous_driving': 'AUD',
+ 'discipline_knowledge_reasoning': 'DKR',
+ 'embodied_ai': 'EA',
+ 'gui_navigation': 'GN'
+}
+
def MMMU_preproc(data):
logger = get_logger('Evaluation')
@@ -54,9 +89,69 @@ def report_acc(df):
abilities = list(set(df[group]))
abilities.sort()
for ab in abilities:
- ab_name = abbrs[ab] if ab in abbrs else ab
+ ab_name = MMB_abbrs[ab] if ab in MMB_abbrs else ab
+ sub_df = df[df[group] == ab]
+ res[ab_name] = [np.mean(sub_df[sub_df['split'] == sp]['hit']) for sp in res['split']]
+ return pd.DataFrame(res)
+
+
+def report_acc_MMT(df):
+ # assert group in [None, 'category', 'l2-category']
+ res = defaultdict(list)
+ res['split'] = list()
+ res['Overall'] = list()
+ for _, name in MMT_abbrs.items():
+ res[name] = list()
+
+ if 'split' in df:
+ splits = list(set(df['split']))
+ res['split'] = splits
+
+ else:
+ df['split'] = ['none'] * len(df)
+ res['split'] = ['none']
+
+ for group in [None, 'category', 'l2-category']:
+ if group is None:
+ res['Overall'] = [np.mean(df[df['split'] == sp]['hit']) for sp in res['split']]
+ res['Overall'].extend([np.mean(df['hit'])])
+ elif group not in df:
+ continue
+ elif group == 'category':
+ abilities = list(set(df[group]))
+ abilities.sort()
+ for ab in abilities:
+ ab_name = ab
sub_df = df[df[group] == ab]
res[ab_name] = [np.mean(sub_df[sub_df['split'] == sp]['hit']) for sp in res['split']]
+ res[ab_name].extend([np.mean(sub_df['hit'])])
+ else:
+ abilities = list(set(df[group]))
+ abilities.sort()
+ for ab in abilities:
+ sub_task_name_list = df[df['l2-category'] == ab]['category'].unique()
+ sub_task_acc = []
+ for sub_task_name in sub_task_name_list:
+ sub_df = df[df['category'] == sub_task_name]
+ sub_task_acc.append([np.mean(sub_df[sub_df['split'] == sp]['hit']) for sp in res['split']])
+
+ new_acc = []
+ for i in range(len(sub_task_acc[0])):
+ new_acc.append(sum([_[i] for _ in sub_task_acc]) / len([_ for _ in sub_task_acc]))
+ ab_name = MMT_abbrs[ab] if ab in MMT_abbrs else ab
+ res[ab_name] = new_acc
+
+ sub_task_acc = []
+ for sub_task_name in sub_task_name_list:
+ sub_df = df[df['category'] == sub_task_name]
+ sub_task_acc.append([np.mean(sub_df['hit'])])
+ new_acc = []
+ for i in range(len(sub_task_acc[0])):
+ new_acc.append(sum([_[i] for _ in sub_task_acc]) / len([_ for _ in sub_task_acc]))
+
+ res[ab_name].extend(new_acc)
+
+ res['split'].append('ALL')
return pd.DataFrame(res)
@@ -142,6 +237,7 @@ def extract_answer_from_item(model, item):
return dict(opt=rd.choice(options), log='Failed to predict, thus randomly generate one. ')
+# For Circular Evaluation
def prefetch_sub_data(sub_data, answer_map, verbose=False):
lt = len(sub_data)
GT, PRED = [], []
@@ -165,6 +261,7 @@ def prefetch_sub_data(sub_data, answer_map, verbose=False):
return ret if len(ret) > 1 else ret[0]
+# For Circular Evaluation
def eval_sub_data(model, sub_data, answer_map):
res, GT, PRED = prefetch_sub_data(sub_data, answer_map, verbose=True)
if res is not None:
@@ -194,6 +291,7 @@ def eval_sub_data(model, sub_data, answer_map):
return dict(hit=1, log=log)
+# For Circular Evaluation
def eval_data_groups(model, data_groups, answer_map, result, result_file, nproc=16):
prefetched = [prefetch_sub_data(g, answer_map, verbose=False) for g in data_groups]
remain = []
@@ -269,6 +367,7 @@ def multiple_choice_eval(eval_file, dataset='default', **judge_kwargs):
logger.error('OPENAI_API_KEY is not set properly, will use exact matching for evaluation')
model = None
+ # Load finished evaluation results
logger.info(f'Evaluating {eval_file}')
result_file = eval_file.replace(f'.{suffix}', f'_{name_str}_result.pkl')
result = {}
@@ -278,9 +377,11 @@ def multiple_choice_eval(eval_file, dataset='default', **judge_kwargs):
data = load(eval_file)
data = data.sort_values(by='index')
data['prediction'] = [str(x) for x in data['prediction']]
+ # If not choice label, then use lower case
for k in data.keys():
data[k.lower() if k not in list(string.ascii_uppercase) else k] = data.pop(k)
+ # Load meta data: when dataset is `default`, will use eval_file as meta data
if dataset != 'default':
meta = TSVDataset(dataset).data
else:
@@ -288,6 +389,7 @@ def multiple_choice_eval(eval_file, dataset='default', **judge_kwargs):
meta = load(eval_file)
assert 'index' in meta and 'answer' in meta, 'Essentail columns missing in the eval_file.'
+ # Build Answer / Category / L2-Category / Split Map
answer_map = {i: c for i, c in zip(meta['index'], meta['answer'])}
cate_map = {i: c for i, c in zip(meta['index'], meta['category'])} if 'category' in meta else None
l2_cate_map = {i: c for i, c in zip(meta['index'], meta['l2-category'])} if 'l2-category' in meta else None
@@ -300,10 +402,12 @@ def multiple_choice_eval(eval_file, dataset='default', **judge_kwargs):
if split_map is not None and np.all([pd.isna(x) for x in split_map.values()]):
split_map = None
+ # Change MMMU open-ended questions to multiple-choice ones for evaluation
if listinstr(['MMMU'], dataset):
data = MMMU_preproc(data)
answer_map = {k: (v if v in list(string.ascii_uppercase) else 'A') for k, v in answer_map.items()}
+ # Only keep those lines in the meta data
data = data[data['index'].isin(answer_map)]
data_main = data[data['index'] < int(1e6)]
meta_idx_set = set(meta['index'])
@@ -359,7 +463,12 @@ def multiple_choice_eval(eval_file, dataset='default', **judge_kwargs):
dump(data_main, eval_file.replace(f'.{suffix}', f'_{name_str}_result.{suffix}'))
data_main = load(eval_file.replace(f'.{suffix}', f'_{name_str}_result.{suffix}'))
- acc = report_acc(data_main)
+ # May have different report acc functions for different datasets
+ if 'MMT' in dataset:
+ acc = report_acc_MMT(data_main)
+ else:
+ acc = report_acc(data_main)
+
score_file = eval_file.replace(f'.{suffix}', '_acc.csv')
dump(acc, score_file)
logger.info(f'multiple_choice_eval successfully finished evaluating {eval_file}, results saved in {score_file}')
diff --git a/vlmeval/utils/__init__.py b/vlmeval/utils/__init__.py
index 93036789..32fdd38b 100644
--- a/vlmeval/utils/__init__.py
+++ b/vlmeval/utils/__init__.py
@@ -2,11 +2,12 @@
from .mp_util import track_progress_rich
from .custom_prompt import CustomPrompt
from .dataset_config import dataset_URLs, img_root_map, DATASET_TYPE, abbr2full
-from .dataset import TSVDataset, split_MMMU, MMMU_result_transfer
+from .dataset import TSVDataset, split_MMMU
+from .result_transfer import MMMU_result_transfer, MMTBench_result_transfer
__all__ = [
'can_infer', 'can_infer_option', 'can_infer_text', 'track_progress_rich',
'TSVDataset', 'dataset_URLs', 'img_root_map', 'DATASET_TYPE', 'CustomPrompt',
- 'split_MMMU', 'abbr2full'
+ 'split_MMMU', 'abbr2full', 'MMMU_result_transfer', 'MMTBench_result_transfer'
]
diff --git a/vlmeval/utils/dataset.py b/vlmeval/utils/dataset.py
index bf4909fa..6b903602 100644
--- a/vlmeval/utils/dataset.py
+++ b/vlmeval/utils/dataset.py
@@ -3,7 +3,6 @@
from ..smp import *
from .dataset_config import dataset_URLs, dataset_md5_dict, DATASET_TYPE
from .custom_prompt import CustomPrompt
-from .matching_util import can_infer
def isliststr(s):
@@ -46,29 +45,6 @@ def split_MMMU(msgs):
return segs
-def MMMU_result_transfer(result_path):
- res = {}
- result_data = load(result_path)
- mcq = result_data['A'].notna()
- lt = len(result_data)
- for i in range(lt):
- line = result_data.iloc[i]
- if mcq[i]:
- options = {
- cand: line[cand]
- for cand in string.ascii_uppercase
- if cand in line and not pd.isna(line[cand])
- }
- prediction = line['prediction']
- infer_prediction = can_infer(prediction, options)
- res[line['id']] = infer_prediction
- else:
- res[line['id']] = line['prediction']
- result_json = result_path.replace('.xlsx', '.json')
- dump(res, result_json)
- return result_json
-
-
class TSVDataset(CustomPrompt):
def __init__(self, dataset='MMBench', skip_noimg=True):
diff --git a/vlmeval/utils/dataset_config.py b/vlmeval/utils/dataset_config.py
index 8b13eab5..813c0380 100644
--- a/vlmeval/utils/dataset_config.py
+++ b/vlmeval/utils/dataset_config.py
@@ -44,6 +44,11 @@
'MMStar': 'https://opencompass.openxlab.space/utils/VLMEval/MMStar.tsv',
'RealWorldQA': 'https://opencompass.openxlab.space/utils/VLMEval/RealWorldQA.tsv',
'POPE': 'https://opencompass.openxlab.space/utils/VLMEval/POPE.tsv',
+ # MMT-Bench
+ 'MMT-Bench_ALL_MI': 'https://opencompass.openxlab.space/utils/VLMEval/MMT-Bench_ALL_MI.tsv',
+ 'MMT-Bench_ALL': 'https://opencompass.openxlab.space/utils/VLMEval/MMT-Bench_ALL.tsv',
+ 'MMT-Bench_VAL_MI': 'https://opencompass.openxlab.space/utils/VLMEval/MMT-Bench_VAL_MI.tsv',
+ 'MMT-Bench_VAL': 'https://opencompass.openxlab.space/utils/VLMEval/MMT-Bench_VAL.tsv',
}
dataset_md5_dict = {
@@ -89,6 +94,11 @@
'MMStar': 'e1ecd2140806c1b1bbf54b43372efb9e',
'RealWorldQA': '92321028d2bc29040284b6674721e48f',
'POPE': 'c12f5acb142f2ef1f85a26ba2fbe41d5',
+ # MMT-Bench
+ 'MMT-Bench_ALL_MI': '5272157097e19cdd7cb41e412ab3b7c7',
+ 'MMT-Bench_ALL': 'b273a2f4c596fe4f2605de0494cd632f',
+ 'MMT-Bench_VAL_MI': 'c7d7b998eb5cd9aa36c7d4f721472462',
+ 'MMT-Bench_VAL': '8dd4b730f53dbf9c3aed90ca31c928e0'
}
img_root_map = {k: k for k in dataset_URLs}
@@ -116,6 +126,11 @@
'MathVista_MINI': 'MathVista',
'HallusionBench': 'Hallusion',
'DocVQA_VAL': 'DocVQA',
+ # MMT-Bench
+ 'MMT-Bench_ALL_MI': 'MMT-Bench',
+ 'MMT-Bench_ALL': 'MMT-Bench',
+ 'MMT-Bench_VAL_MI': 'MMT-Bench',
+ 'MMT-Bench_VAL': 'MMT-Bench'
})
assert set(dataset_URLs) == set(img_root_map)
@@ -124,7 +139,10 @@
def DATASET_TYPE(dataset):
# Dealing with Custom Dataset
dataset = dataset.lower()
- if listinstr(['mmbench', 'seedbench', 'ccbench', 'mmmu', 'scienceqa', 'ai2d', 'mmstar', 'realworldqa'], dataset):
+ if listinstr([
+ 'mmbench', 'seedbench', 'ccbench', 'mmmu', 'scienceqa', 'ai2d',
+ 'mmstar', 'realworldqa', 'mmt-bench'
+ ], dataset):
return 'multi-choice'
elif listinstr(['mme', 'hallusion', 'pope'], dataset):
return 'Y/N'
diff --git a/vlmeval/utils/result_transfer.py b/vlmeval/utils/result_transfer.py
new file mode 100644
index 00000000..7de633eb
--- /dev/null
+++ b/vlmeval/utils/result_transfer.py
@@ -0,0 +1,97 @@
+from ..evaluate.misc import build_judge
+from ..evaluate.multiple_choice import extract_answer_from_item
+
+from ..smp import *
+from .matching_util import can_infer
+from .mp_util import track_progress_rich
+
+
+def MMMU_result_transfer(result_path):
+ res = {}
+ result_data = load(result_path)
+ mcq = result_data['A'].notna()
+ lt = len(result_data)
+ for i in range(lt):
+ line = result_data.iloc[i]
+ if mcq[i]:
+ options = {
+ cand: line[cand]
+ for cand in string.ascii_uppercase
+ if cand in line and not pd.isna(line[cand])
+ }
+ prediction = line['prediction']
+ infer_prediction = can_infer(prediction, options)
+ res[line['id']] = infer_prediction
+ else:
+ res[line['id']] = line['prediction']
+ result_json = result_path.replace('.xlsx', '.json')
+ dump(res, result_json)
+ return result_json
+
+
+def MMTBench_result_transfer(eval_file, dataset='default', **judge_kwargs):
+ logger = get_logger('Evaluation')
+ INTERNAL = os.environ.get('INTERNAL', 0)
+ nproc = judge_kwargs.pop('nproc', 4)
+
+ rd.seed(2680)
+ suffix = eval_file.split('.')[-1]
+ model = judge_kwargs['model']
+ assert model in ['chatgpt-0613', 'exact_matching', 'gpt-4-0125']
+ name_str_map = {
+ 'chatgpt-0613': 'openai',
+ 'gpt-4-0125': 'gpt4'
+ }
+ name_str = name_str_map[model] if model in name_str_map else model
+
+ if model == 'exact_matching':
+ model = None
+ else:
+ if INTERNAL or gpt_key_set():
+ model = build_judge(**judge_kwargs)
+ else:
+ logger.error('OPENAI_API_KEY is not set properly, will use exact matching for evaluation')
+ model = None
+
+ logger.info(f'Evaluating {eval_file}')
+ result_file = eval_file.replace(f'.{suffix}', f'_{name_str}_option.pkl')
+ result = {}
+ if osp.exists(result_file):
+ result = load(result_file)
+
+ data = load(eval_file)
+ assert 'index' in data, 'Essentail columns missing in the eval_file.'
+
+ data = data.sort_values(by='index')
+ data['prediction'] = [str(x) for x in data['prediction']]
+ for k in data.keys():
+ data[k.lower() if k not in list(string.ascii_uppercase) else k] = data.pop(k)
+
+ idx2lines = {data.iloc[i]['index']: data.iloc[i] for i in range(len(data))}
+ idx2lines = {k: v for k, v in idx2lines.items() if k not in result}
+
+ indices = list(idx2lines.keys())
+ lines = [idx2lines[i] for i in indices]
+ tups = [(model, line) for line in lines]
+ res = track_progress_rich(
+ extract_answer_from_item,
+ tups,
+ nproc=nproc,
+ chunksize=nproc,
+ save=result_file,
+ keys=indices)
+
+ for i, r in zip(indices, res):
+ if i in result:
+ assert result[i]['opt'] == r['opt'] and result[i]['log'] == r['log']
+ else:
+ result[i] = r
+
+ indices = list(data['index'])
+ data['opt'] = [result[i]['opt'] for i in data['index']]
+ data['log'] = [result[i]['log'] for i in data['index']]
+
+ # load split
+ output_path = eval_file.replace(f'.{suffix}', f'_{name_str}_submission.tsv')
+ dump(data, eval_file.replace(f'.{suffix}', f'_{name_str}_submission.tsv'))
+ return output_path