FlagEmbedding/MLVU/evaluation at master · FlagOpen/FlagEmbedding

History

Name		Name	Last commit message	Last commit date
parent directory ..
generation_evaluation		generation_evaluation
models		models
multiple_choice_evaluation		multiple_choice_evaluation
README.md		README.md

README.md

Evaluation for MLVU

We provide detailed evaluation methods for MLVU, including Multiple-choice tasks and generation tasks.

Benchmark MLVU on your Model

Firstly, If you want to benchmark MLVU in your models, you can refer to our template test code as follows:

Multiple-Choice testing

python multiple_choice_evaluation/choice_bench.py

You must load your model into this template and evaluate the multiple-choice performance online.

Generation testing

Step 1 Get the inference results of Sub-Scene Captioning and Video Summary.

python generation_evaluation/open_bench.py

Step 2 Run the evaluation for the generation tasks. For Sub-Scene Captioning, modify your pred_path (by step 1) and output_dir then run

python evaluate_ssc.py --pred_path /your_path/subplot_all.json --output_dir /eval_subplot  --output_json /eval_subplot.json
python calculate.py --path /eval_subplot

For Video Summarization, modify your pred_path (by step 1) and output_dir then run

python evaluate_summary.py --pred_path /your_path/summary_all.json --output_dir /eval_summary  --output_json /eval_summary.json

Then run, and you need to modify the path in it to your output_dir

python calculate_sum.py --path /eval_summary

Benchmark MLVU on existing models

(Take VideoChat2 as an example:)

step 1: Download original models as well as weights from VideoChat2
step 2: Put choice_bench.py and open_bench.py into the folder as the same as demo.py
step 3: modify your path of the MLVU in choice_bench.py and open_bench.py
step 4: run the inference and online evaluation for Multiple-choice tasks.
step 5: run the inference and evaluation for the generation tasks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evaluation

evaluation

README.md

Evaluation for MLVU

Benchmark MLVU on your Model

Multiple-Choice testing

Generation testing

Benchmark MLVU on existing models

Files

evaluation

Directory actions

More options

Directory actions

More options

Latest commit

History

evaluation

Folders and files

parent directory

README.md

Evaluation for MLVU

Benchmark MLVU on your Model

Multiple-Choice testing

Generation testing

Benchmark MLVU on existing models