MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

Updates & News

Due to the potential deadlines from NeurIPS and EMNLP, our newly added dataset and code will be released in a few weeks. Thanks for your patience.

[01/05/2024] 🌔 Our paper is accepted by ICML 2024!
[20/03/2024] ⭐ We release our complete dataset with guidelines and scripts for benchmarking current VLMs!
[14/02/2024] 📄 We release our paper on Arxiv today!

Benchmark:MLLM-as-a-Judge

This benchmark is structured into three main components: images, the main dataset, and sub-datasets. The arrangement is as follows:

/MLLM-Judge
├── Figures (images for github repository)
├── Datasets
│   └── MLLM-as-a-Judge
│       ├── step1
│       ├── step2
│       ├── step3
│       └── Human Annotation
└── Hard & HQ
    ├── Hard
    └── HQ

Figures: Contains images for the GitHub repository. These images are used to illustrate and explain the contents of the repository, aiding users in better understanding the project.
Datasets/MLLM-as-a-Judge: This part of the dataset is developed in three steps, mirroring the structure outlined in our article. It includes MLLM outputs under three different settings: Scoring Evaluation, Pair Comparison, and Batch Ranking. Additionally, this section encompasses human annotation results and agreement data. In Scoring Evaluation, we also include responses data in a verbose setting for our ablation study.
- Benchmark: The Final dataset with human annotations used as a benchmark to assess model performance. These annotations provide a reliable reference to verify if the model's judgments align with human evaluations.
- step1: Contains original image-instruction pairs selected from 10 datasets. This is the starting point for data processing and model training, containing the initial input data.
- step2: Contains response data generated by four different MLLMs. This step aims to enrich the dataset and increase its diversity by generating data through multiple models.
- step3: Divides the data from step2 into three parts, each under different settings, containing responses from various MLLM Judges. This helps analyze and compare the performance differences across models under the same tasks.
Hard & HQ: Contains two specially curated datasets for specific data analysis and model training purposes:
- Hard: Includes samples considered difficult under three different settings. This data is used to test and improve MLLM capabilities in dealing with complex scenarios.
- HQ (High Quality): Contains samples where the MLLM-as-a-Judge performed well. These high-quality samples help understand under what conditions the model performs best.
Images: All images utilized in our study are contained in this section. You can download all images in google drive and extracted to the specified path (<your path>/MLLM-Judge/).

Our comprehensive dataset and benchmarks are crafted with the aim of contributing to the development of stronger and more reliable MLLM-as-a-Judge systems in the future.

Benchmark mainstream MLLMs

Collect Judgments from MLLMs

GPT-4V(ision)

You can run the following script in shell to collect GPT-4V's judgement:

# Batch evaluation in No COT settings
python gpt4_judge.py --input_json '<your_path>/MLLM-Judge/Dataset/Benchmark/Batch.jsonl' --output_json './Batch.jsonl'  --image_root '<your_path>/MLLM-Judge/images' --evaluate 'Batch' --setting 'No COT'
# Score evaluation in No COT settings
python gpt4_judge.py --input_json
'<your_path>/MLLM-Judge/Dataset/Benchmark/Score.jsonl' --output_json './Score.jsonl'  --image_root '<your_path>/MLLM-Judge/images' --evaluate 'Score' --setting 'No COT'
# Pair Comparison in No COT settings
python gpt4_judge.py --input_json
'<your_path>/MLLM-Judge/Dataset/Benchmark/Pair.jsonl' --output_json './Pair.jsonl'  --image_root '<your_path>/MLLM-Judge/images' --evaluate 'Pair' --setting 'No COT'

LLaVA

You should first follow instructions in LLaVA's repository to download llava-v1.5-13b and create a new python environment called LLaVA. Then, you need to move \scripts\llava_inference.py to <your path>/LLaVA and move to the <your path>/LLaVA. Then, you can run the following to produce MLLM's judging result:

# Batch evaluation in No COT settings
python llava_judge.py --input_json '<your_path>/MLLM-Judge/Dataset/Benchmark/Batch.jsonl' --output_json './Batch.jsonl'  --image_root '<your_path>/MLLM-Judge/images' --evaluate 'Batch' --setting 'No COT'
# Score evaluation in No COT settings
python llava_judge.py --input_json
'<your_path>/MLLM-Judge/Dataset/Benchmark/Score.jsonl' --output_json './Score.jsonl'  --image_root '<your_path>/MLLM-Judge/images' --evaluate 'Score' --setting 'No COT'
# Pair Comparison in No COT settings
python llava_judge.py --input_json
'<your_path>/MLLM-Judge/Dataset/Benchmark/Pair.jsonl' --output_json './Pair.jsonl'  --image_root '<your_path>/MLLM-Judge/images' --evaluate 'Pair' --setting 'No COT'

Gemini

You should first register your google account to get a Gemini-Pro API or run the script in Colab Pro. To collect judging results from Gemini-Pro-Vision, you should run the following scripts in shell:

# Batch evaluation in No COT settings
python gemini_judge.py --api 'your_api' --input_json '<your_path>/MLLM-Judge/Dataset/Benchmark/Batch.jsonl' --output_json './Batch.jsonl'  --image_root '<your_path>/MLLM-Judge/images' --evaluate 'Batch' --setting 'No COT'
# Score evaluation in No COT settings
python gemini_judge.py --api 'your_api' --input_json '<your_path>/MLLM-Judge/Dataset/Benchmark/Score.jsonl' --output_json './Score.jsonl'  --image_root '<your_path>/MLLM-Judge/images' --evaluate 'Batch' --setting 'No COT'
# Pair Comparison in No COT settings
python gemini_judge.py --api 'your_api' --input_json '<your_path>/MLLM-Judge/Dataset/Benchmark/Pair.jsonl' --output_json './Pair.jsonl'  --image_root '<your_path>/MLLM-Judge/images' --evaluate 'Batch' --setting 'No COT'

Notice: If you run Gemini in your local environment, the inference limitation is very severe, reaching only 60 QPM.

CogVLM

You should follow the instruction in CogVLM's repository to download CogVLM checkpoint. Then, you should move scripts/cogvlm_judge.py to <your_path>/CogVLM to collect judgment from CogVLM, using the following scripts in shell:

# Score evaluation in No COT settings
python cogvlm_judge.py --api 'your_api' --input_json '<your_path>/MLLM-Judge/Dataset/Benchmark/Score.jsonl' --output_json './Score.jsonl'  --image_root '<your_path>/MLLM-Judge/images' --evaluate 'Batch' --setting 'No COT'
# Pair Comparison in No COT settings
python cogvlm_judge.py --api 'your_api' --input_json '<your_path>/MLLM-Judge/Dataset/Benchmark/Pair.jsonl' --output_json './Pair.jsonl'  --image_root '<your_path>/MLLM-Judge/images' --evaluate 'Batch' --setting 'No COT'

Notice: CogVLM cannot follow our output template in Batch Evaluation setting.

Other MLLMs

We also benchmark our MLLM-as-a-Judge on GLM-4V and minicpm-v. However, they cannot follow the output template or inherent a huge bias in judging and excluded in our experiments.

Contributing

Contributions to this project are welcome. Please consider the following ways to contribute:

Reporting issues
Proposing new features or improvements
Benchmark other mainstream MLLMs

Acknowledgments

This project is based on the findings and methodologies presented in the paper LLM-as-a-Judge and HallusionBench.

Citation

@misc{chen2024mllmasajudge,
      title={MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark}, 
      author={Dongping Chen and Ruoxi Chen and Shilin Zhang and Yinuo Liu and Yaochen Wang and Huichi Zhou and Qihui Zhang and Pan Zhou and Yao Wan and Lichao Sun},
      year={2024},
      eprint={2402.04788},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

Updates & News

Contents

Benchmark:MLLM-as-a-Judge

Benchmark mainstream MLLMs

Collect Judgments from MLLMs

GPT-4V(ision)

LLaVA

Gemini

CogVLM

Other MLLMs

Contributing

Acknowledgments

Citation

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
Dataset		Dataset
Figures		Figures
Hard & HQ		Hard & HQ
image		image
images		images
scripts		scripts
readme.md		readme.md

Dongping-Chen/MLLM-Judge

Folders and files

Latest commit

History

Repository files navigation

MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

Updates & News

Contents

Benchmark:MLLM-as-a-Judge

Benchmark mainstream MLLMs

Collect Judgments from MLLMs

GPT-4V(ision)

LLaVA

Gemini

CogVLM

Other MLLMs

Contributing

Acknowledgments

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages