- [2024-10] 📰 We have released both the LIME dataset and the data duration pipeline!
- [2024-09] 🍋 We have open-sourced the evaluation data and corresponding evaluation code for
LIME
. The data duration pipeline for LIME will be open-sourced within two weeks.
We use a gneral data process pipeline and curate a LIME, which contains 9403 samples and is refined across 10 tasks within 6 domains. We select six major tasks in the Multimodal domain and use 9 MLLMs to refine those 10 benchmarks within the corresponding domain.
For quickly start using LIME, we recommend following the lmms-eval tutorial to quickly deploy the evaluation environment.
also you can install by following steps
git clone https://anonymous.4open.science/r/LIME-49CD
cd lmms-eval
pip install -e .
download all datasets from here
You can run scripts for all the subtasks included in LIME-M using the following method.
accelerate launch --num_processes=8 -m lmms_eval --model internvl2 --model_args pretrained="OpenGVLab/InternVL2-8B" --tasks textcaps_suit,ok_vqa_suit,coco_cap_suit,textvqa_suit,chartqa_suit,pope_suit,infovqa_suit,ai2d_suit,ocrbench_suit,scienceqa_img_suit --batch_size 1 --log_samples --log_samples_suffix internvl2_suits --summary True --output_path output_path
we utlize VLLM for text only evaluation
python lmms_eval/__main__.py --model llama --model_args pretrained="meta-llama/Meta-Llama-3-8B-Instruct" --tasks ai2d_suit,scienceqa_img_suit --batch_size 1 --log_samples --log_samples_suffix llama3_8b_text_only --summary True --output_path output_path
model_path refers to the local storage path of the model, and output_path refers to the location where the final logs are stored.
The data duration pipeline consists of three parts: (1) Using open-source models as judges, (2) A semi-automated screening process, and (3) Eliminating answer leakage.
You can reproducte the process through the following steps:
By running this step, you can collect all models results.
python data_curation_pipeline/Models_Judges.py
now we need to classify the difficulty level of each sample. We define N as the number of models that correctly answer the sample. If N ≥ 6, the question is classified as the easy set. If 3 ≤ N ≤ 5, it is classified as the middle set. Conversely, if N ≤ 2, it is classified as the hard set.
To mitigate these potential errors and filter out totally incorrect questions, we use gpt double. running data_curation_pipeline/gpt_double_check.py & data_curation_pipeline/Human_double_check.ipynb
For Eliminating answer leakage, we use pure-text models for evaluation, and the other processes are similar to those mentioned above.