Skip to content

Latest commit

 

History

History

evaluation

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

Evaluation for MLVU

We provide detailed evaluation methods for MLVU, including Multiple-choice tasks and generation tasks.

Benchmark MLVU on your Model

Firstly, If you want to benchmark MLVU in your models, you can refer to our template test code as follows:

Multiple-Choice testing

python multiple_choice_evaluation/choice_bench.py 

You must load your model into this template and evaluate the multiple-choice performance online.

Generation testing

  • Step 1 Get the inference results of Sub-Scene Captioning and Video Summary.
python generation_evaluation/open_bench.py 
  • Step 2 Run the evaluation for the generation tasks. For Sub-Scene Captioning, modify your pred_path (by step 1) and output_dir then run
python evaluate_ssc.py --pred_path /your_path/subplot_all.json --output_dir /eval_subplot  --output_json /eval_subplot.json
python calculate.py --path /eval_subplot

For Video Summarization, modify your pred_path (by step 1) and output_dir then run

python evaluate_summary.py --pred_path /your_path/summary_all.json --output_dir /eval_summary  --output_json /eval_summary.json

Then run, and you need to modify the path in it to your output_dir

python calculate_sum.py --path /eval_summary

Benchmark MLVU on existing models

(Take VideoChat2 as an example:)

  • step 1: Download original models as well as weights from VideoChat2
  • step 2: Put choice_bench.py and open_bench.py into the folder as the same as demo.py
  • step 3: modify your path of the MLVU in choice_bench.py and open_bench.py
  • step 4: run the inference and online evaluation for Multiple-choice tasks.
  • step 5: run the inference and evaluation for the generation tasks.