LIME: LESS IS MORE FOR MLLM EVALUATION

Annoucement

[2024-10] 📰 We have released both the LIME dataset and the data duration pipeline!
[2024-09] 🍋 We have open-sourced the evaluation data and corresponding evaluation code for LIME. The data duration pipeline for LIME will be open-sourced within two weeks.

Introduction

We use a gneral data process pipeline and curate a LIME, which contains 9403 samples and is refined across 10 tasks within 6 domains. We select six major tasks in the Multimodal domain and use 9 MLLMs to refine those 10 benchmarks within the corresponding domain.

How to use LIME

1. Installation

For quickly start using LIME, we recommend following the lmms-eval tutorial to quickly deploy the evaluation environment.

also you can install by following steps

git clone https://anonymous.4open.science/r/LIME-49CD
cd lmms-eval
pip install -e .

2. download dataset from huggingface

download all datasets from here

3.run evaluation

For MLLMs evaluation:

You can run scripts for all the subtasks included in LIME-M using the following method.

accelerate launch --num_processes=8 -m lmms_eval --model internvl2 --model_args pretrained="OpenGVLab/InternVL2-8B"  --tasks textcaps_suit,ok_vqa_suit,coco_cap_suit,textvqa_suit,chartqa_suit,pope_suit,infovqa_suit,ai2d_suit,ocrbench_suit,scienceqa_img_suit  --batch_size 1 --log_samples --log_samples_suffix internvl2_suits --summary True --output_path output_path

For LLMs evaluation:

we utlize VLLM for text only evaluation

python lmms_eval/__main__.py --model llama   --model_args pretrained="meta-llama/Meta-Llama-3-8B-Instruct"  --tasks ai2d_suit,scienceqa_img_suit --batch_size 1 --log_samples --log_samples_suffix llama3_8b_text_only --summary True --output_path output_path

model_path refers to the local storage path of the model, and output_path refers to the location where the final logs are stored.

overall Leadboard

data duration pipeline

The data duration pipeline consists of three parts: (1) Using open-source models as judges, (2) A semi-automated screening process, and (3) Eliminating answer leakage.

You can reproducte the process through the following steps:

1.collect models result

By running this step, you can collect all models results.

python data_curation_pipeline/Models_Judges.py

2.classify samples' category

now we need to classify the difficulty level of each sample. We define N as the number of models that correctly answer the sample. If N ≥ 6, the question is classified as the easy set. If 3 ≤ N ≤ 5, it is classified as the middle set. Conversely, if N ≤ 2, it is classified as the hard set.

3.gpt double check & human double check

To mitigate these potential errors and filter out totally incorrect questions, we use gpt double. running data_curation_pipeline/gpt_double_check.py & data_curation_pipeline/Human_double_check.ipynb

4. Eliminating answer leakage.

For Eliminating answer leakage, we use pure-text models for evaluation, and the other processes are similar to those mentioned above.

Name		Name	Last commit message	Last commit date
Latest commit History 951 Commits
.github		.github
DeepSeek-VL		DeepSeek-VL
__MACOSX		__MACOSX
data_curation_pipeline		data_curation_pipeline
docs		docs
excel_result		excel_result
imgs		imgs
lmms_eval		lmms_eval
miscs		miscs
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
llava_next_110B.py		llava_next_110B.py
llava_next_110B_all.py		llava_next_110B_all.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LIME: LESS IS MORE FOR MLLM EVALUATION

Annoucement

Introduction

How to use LIME

1. Installation

2. download dataset from huggingface

3.run evaluation

For MLLMs evaluation:

For LLMs evaluation:

overall Leadboard

data duration pipeline

1.collect models result

2.classify samples' category

3.gpt double check & human double check

4. Eliminating answer leakage.

About

Releases

Packages

Languages

License

kangreen0210/LIME

Folders and files

Latest commit

History

Repository files navigation

LIME: LESS IS MORE FOR MLLM EVALUATION

Annoucement

Introduction

How to use LIME

1. Installation

2. download dataset from huggingface

3.run evaluation

For MLLMs evaluation:

For LLMs evaluation:

overall Leadboard

data duration pipeline

1.collect models result

2.classify samples' category

3.gpt double check & human double check

4. Eliminating answer leakage.

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages