update

open-compass · kennymckormick · Mar 19, 2024 · Mar 16, 2024 · Mar 16, 2024 · Mar 16, 2024
commit db460d1e9ac65e8e986c4a057c7a4c86ee9cf541
diff --git a/Custom_Benchmark_and_Model.md → Development.md b/Custom_Benchmark_and_Model.md → Development.md
@@ -2,6 +2,8 @@
 
 ## Implement a new benchmark
 
+Example PR: **Add OCRBench** ([#91](https://github.com/open-compass/VLMEvalKit/pull/91/files))
+
 Currently, we organize a benchmark as one single TSV file. During inference, the data file will be automatically downloaded to `$LMUData` (default path is `$HOME/LMUData`, if not set explicitly). All existing benchmark TSV files are handled by `TSVDataset` implemented in `vlmeval/utils/dataset_config.py`.
 
 | Dataset Name \ Fields | index | image | image_path | question | hint | multi-choice<br>options | answer | category | l2-category | split |
@@ -31,6 +33,20 @@ Besides, your dataset class **should implement the method `build_prompt(self, li
 
 ## Implement a new model
 
+Example PR: **Support Monkey** ([#45](https://github.com/open-compass/VLMEvalKit/pull/45/files))
+
 All existing models are implemented in `vlmeval/vlm`. For a minimal model, your model class **should implement the method** `generate(image_path, prompt, dataset=None)`. In this function, you feed the image and prompt to your VLM and return the VLM prediction (which is a string). The optional argument `dataset` can be used as the flag for the model to switch among various inference strategies.
 
 Besides, your model can support custom prompt building by implementing an optional method `build_prompt(line, dataset=None)`. In this function, the line is a dictionary that includes the necessary information of a data sample, while `dataset` can be used as the flag for the model to switch among various prompt building strategies.
+
+## Contribute to VLMEvalKit
+
+If you want to contribute codes to **VLMEvalKit**, please do the pre-commit check before you submit a PR. That helps to keep the code tidy.
+
+```bash
+# Under the directory of VLMEvalKit, install the pre-commit hook:
+pip install pre-commit
+pre-commit install
+pre-commit run --all-files
+# Then you can commit your code.
+```
diff --git a/Quickstart.md b/Quickstart.md
@@ -0,0 +1,58 @@
+# Quickstart
+
+Before running the evaluation script, you need to **configure** the VLMs and set the model_paths properly.
+
+After that, you can use a single script `run.py` to inference and evaluate multiple VLMs and benchmarks at a same time.
+
+## Step0. Installation
+
+```bash
+git clone https://github.com/open-compass/VLMEvalKit.git
+cd VLMEvalKit
+pip install -e .
+```
+
+## Step1. Configuration
+
+**VLM Configuration**: All VLMs are configured in `vlmeval/config.py`, for some VLMs, you need to configure the code root (MiniGPT-4, PandaGPT, etc.) or the model_weight root (LLaVA-v1-7B, etc.) before conducting the evaluation. During evaluation, you should use the model name specified in `supported_VLM` in `vlmeval/config.py` to select the VLM. For MiniGPT-4 and InstructBLIP, you also need to modify the config files in `vlmeval/vlm/misc` to configure LLM path and ckpt path.
+
+Following VLMs require the configuration step:
+
+**Code Preparation & Installation**: InstructBLIP ([LAVIS](https://github.com/salesforce/LAVIS)), LLaVA ([LLaVA](https://github.com/haotian-liu/LLaVA)), MiniGPT-4 ([MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4)), mPLUG-Owl2 ([mPLUG-Owl2](https://github.com/X-PLUG/mPLUG-Owl/tree/main/mPLUG-Owl2)), OpenFlamingo-v2 ([OpenFlamingo](https://github.com/mlfoundations/open_flamingo)), PandaGPT-13B ([PandaGPT](https://github.com/yxuansu/PandaGPT)), TransCore-M ([TransCore-M](https://github.com/PCIResearch/TransCore-M)).
+
+**Manual Weight Preparation & Configuration**: InstructBLIP, LLaVA-v1-7B, MiniGPT-4, PandaGPT-13B
+
+## Step2. Evaluation
+
+We use `run.py` for evaluation. To use the script, you can use `$VLMEvalKit/run.py` or create a soft-link of the script (to use the script anywhere):
+
+**Arguments**
+
+- `--data (list[str])`: Set the dataset names that are supported in VLMEvalKit (defined in `vlmeval/utils/dataset_config.py`).
+- `--model (list[str])`: Set the VLM names that are supported in VLMEvalKit (defined in `supported_VLM` in `vlmeval/config.py`).
+- `--mode (str, default to 'all', choices are ['all', 'infer'])`: When `mode` set to "all", will perform both inference and evaluation; when set to "infer", will only perform the inference.
+- `--nproc (int, default to 4)`: The number of threads for OpenAI API calling.
+
+**Command**
+
+You can run the script with `python` or `torchrun`:
+
+```bash
+# When running with `python`, only one VLM instance is instantiated, and it might use multiple GPUs (depending on its default behavior).
+# That is recommended for evaluating very large VLMs (like IDEFICS-80B-Instruct).
+
+# IDEFICS-80B-Instruct on MMBench_DEV_EN, MME, and SEEDBench_IMG, Inference and Evalution
+python run.py --data MMBench_DEV_EN MME SEEDBench_IMG --model idefics_80b_instruct --verbose
+# IDEFICS-80B-Instruct on MMBench_DEV_EN, MME, and SEEDBench_IMG, Inference only
+python run.py --data MMBench_DEV_EN MME SEEDBench_IMG --model idefics_80b_instruct --verbose --mode infer
+
+# When running with `torchrun`, one VLM instance is instantiated on each GPU. It can speed up the inference.
+# However, that is only suitable for VLMs that consume small amounts of GPU memory.
+
+# IDEFICS-9B-Instruct, Qwen-VL-Chat, mPLUG-Owl2 on MMBench_DEV_EN, MME, and SEEDBench_IMG. On a node with 8 GPU. Inference and Evaluation.
+torchrun --nproc-per-node=8 run.py --data MMBench_DEV_EN MME SEEDBench_IMG --model idefics_80b_instruct qwen_chat mPLUG-Owl2 --verbose
+# Qwen-VL-Chat on MME. On a node with 2 GPU. Inference and Evaluation.
+torchrun --nproc-per-node=2 run.py --data MME --model qwen_chat --verbose
+```
+
+The evaluation results will be printed as logs, besides. **Result Files** will also be generated in the directory `$YOUR_WORKING_DIRECTORY/{model_name}`. Files ending with `.csv` contain the evaluated metrics.
diff --git a/README.md b/README.md
@@ -5,7 +5,7 @@
 <a href="https://rank.opencompass.org.cn/leaderboard-multimodal">🏆 Learderboard </a> •
 <a href="#-datasets-models-and-evaluation-results">📊Datasets & Models </a> •
 <a href="#%EF%B8%8F-quickstart">🏗️Quickstart </a> •
-<a href="#%EF%B8%8F-custom-benchmark-or-vlm">🛠️Support New </a> •
+<a href="#%EF%B8%8F-development-guide">🛠️Development </a> •
 <a href="#-the-goal-of-vlmevalkit">🎯Goal </a> •
 <a href="#%EF%B8%8F-citation">🖊️Citation </a>
 </div>
@@ -100,69 +100,11 @@ print(ret) # There are two apples in the provided images.
 
 ## 🏗️ QuickStart
 
-Before running the evaluation script, you need to **configure** the VLMs and set the model_paths properly.
+See [QuickStart](/QuickStart.md) for a quick start guide.
 
-After that, you can use a single script `run.py` to inference and evaluate multiple VLMs and benchmarks at a same time.
+## 🛠️ Development Guide
 
-### Step0. Installation
-
-```bash
-git clone https://github.com/open-compass/VLMEvalKit.git
-cd VLMEvalKit
-pip install -e .
-```
-
-### Step1. Configuration
-
-**VLM Configuration**: All VLMs are configured in `vlmeval/config.py`, for some VLMs, you need to configure the code root (MiniGPT-4, PandaGPT, etc.) or the model_weight root (LLaVA-v1-7B, etc.) before conducting the evaluation. During evaluation, you should use the model name specified in `supported_VLM` in `vlmeval/config.py` to select the VLM. For MiniGPT-4 and InstructBLIP, you also need to modify the config files in `vlmeval/vlm/misc` to configure LLM path and ckpt path.
-
-Following VLMs require the configuration step:
-
-**Code Preparation & Installation**: InstructBLIP ([LAVIS](https://github.com/salesforce/LAVIS)), LLaVA ([LLaVA](https://github.com/haotian-liu/LLaVA)), MiniGPT-4 ([MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4)), mPLUG-Owl2 ([mPLUG-Owl2](https://github.com/X-PLUG/mPLUG-Owl/tree/main/mPLUG-Owl2)), OpenFlamingo-v2 ([OpenFlamingo](https://github.com/mlfoundations/open_flamingo)), PandaGPT-13B ([PandaGPT](https://github.com/yxuansu/PandaGPT)), TransCore-M ([TransCore-M](https://github.com/PCIResearch/TransCore-M)).
-
-**Manual Weight Preparation & Configuration**: InstructBLIP, LLaVA-v1-7B, MiniGPT-4, PandaGPT-13B
-
-### Step2. Evaluation
-
-We use `run.py` for evaluation. To use the script, you can use `$VLMEvalKit/run.py` or create a soft-link of the script (to use the script anywhere):
-
-**Arguments**
-
-- `--data (list[str])`: Set the dataset names that are supported in VLMEvalKit (defined in `vlmeval/utils/dataset_config.py`).
-- `--model (list[str])`: Set the VLM names that are supported in VLMEvalKit (defined in `supported_VLM` in `vlmeval/config.py`).
-- `--mode (str, default to 'all', choices are ['all', 'infer'])`: When `mode` set to "all", will perform both inference and evaluation; when set to "infer", will only perform the inference.
-- `--nproc (int, default to 4)`: The number of threads for OpenAI API calling.
-
-**Command**
-
-You can run the script with `python` or `torchrun`:
-
-```bash
-# When running with `python`, only one VLM instance is instantiated, and it might use multiple GPUs (depending on its default behavior).
-# That is recommended for evaluating very large VLMs (like IDEFICS-80B-Instruct).
-
-# IDEFICS-80B-Instruct on MMBench_DEV_EN, MME, and SEEDBench_IMG, Inference and Evalution
-python run.py --data MMBench_DEV_EN MME SEEDBench_IMG --model idefics_80b_instruct --verbose
-# IDEFICS-80B-Instruct on MMBench_DEV_EN, MME, and SEEDBench_IMG, Inference only
-python run.py --data MMBench_DEV_EN MME SEEDBench_IMG --model idefics_80b_instruct --verbose --mode infer
-
-# When running with `torchrun`, one VLM instance is instantiated on each GPU. It can speed up the inference.
-# However, that is only suitable for VLMs that consume small amounts of GPU memory.
-
-# IDEFICS-9B-Instruct, Qwen-VL-Chat, mPLUG-Owl2 on MMBench_DEV_EN, MME, and SEEDBench_IMG. On a node with 8 GPU. Inference and Evaluation.
-torchrun --nproc-per-node=8 run.py --data MMBench_DEV_EN MME SEEDBench_IMG --model idefics_80b_instruct qwen_chat mPLUG-Owl2 --verbose
-# Qwen-VL-Chat on MME. On a node with 2 GPU. Inference and Evaluation.
-torchrun --nproc-per-node=2 run.py --data MME --model qwen_chat --verbose
-```
-
-The evaluation results will be printed as logs, besides. **Result Files** will also be generated in the directory `$YOUR_WORKING_DIRECTORY/{model_name}`. Files ending with `.csv` contain the evaluated metrics.
-
-## 🛠️ Custom Benchmark or VLM
-
-To implement a custom benchmark or VLM in **VLMEvalKit**, please refer to [Custom_Benchmark_and_Model](/Custom_Benchmark_and_Model.md). Example PRs to follow:
-
-- [**New Model**] Support Monkey ([#45](https://github.com/open-compass/VLMEvalKit/pull/45/files))
-- [**New Benchmark**] Support AI2D ([#51](https://github.com/open-compass/VLMEvalKit/pull/51/files))
+To develop custom benchmarks, VLMs, or simply contribute other codes to **VLMEvalKit**, please refer to [Development_Guide](/Development.md).
 
 ## 🎯 The Goal of VLMEvalKit