β‘ A repository for evaluating AudioLLMs in various tasks π β‘
β‘ AudioBench: A Universal Benchmark for Audio Large Language Models π β‘
- Sep 2024: Add MuChoMusic dataset for music evaluation (multiple choice questions).
- Aug 2024: Support a couple of speech translation datasets. Update the evaluation script for several MCQ evaluation.
- Aug 2024: Leadboard is live. Check it out here.
- July 2024: We are working hard on the leaderboard and speech translation dataset. Stay tuned!
- July 2024: Support all 26 datasets listed in AudioBench manuscript.
Installation with pip:
pip install -r requirements.txt
For model-as-judge evaluation, we serve the judgement model as a service via vllm
on port 5000
.
The example is hosting a Llama-3-70B-Instruct
model and running the cascade Whisper + Llama-3
model.
# Step 1:
# Server the model as judge
# It will auto-download the model and may requires verification from Hugging Face.
# In the demo, we use 2 H100 80G in order to host the model.
# For smaller VRAM, you may need to reduce the model size.
# bash host_model_judge_llama_3_70b_instruct.sh
# Another option (recommended) is to use the quantized model which could be hosted on 2*40G GPUs.
bash host_model_judge_llama_3_70b_instruct_awq.sh
# Step 2:
# The example is done with 3 H100 80G GPUs.
# The AudioLLMs model inference is done on GPU 2 since GPU 0&1 is used to host model-as-judge services.
# This is a setting for just using 50 samples for evaluation.
MODEL_NAME=whisper_large_v3_with_llama_3_8b_instruct
GPU=2
BATCH_SIZE=1
METRICS=llama3_70b_judge_binary
OVERWRITE=True
NUMBER_OF_SAMPLES=50
DATASET=cn_college_listen_mcq_test
bash eval.sh $DATASET $MODEL_NAME $GPU $BATCH_SIZE $OVERWRITE $METRICS $NUMBER_OF_SAMPLES
# Step 3:
# The results would be like:
# "llama3_70b_judge_binary": {
# "judge_score": 90.0,
# "success_rate": 1.0
# }
#}
# This indicates that the cascade model can achieve 90% accuracy on the MCQ task for English listening test.
The example is how to get started. To evaluate on the full datasets, please refer to Examples.
# After model weight download, run the evaluation script for all datasets
bash examples/eval_salmonn_7b.sh
- ASR: Automatic Speech Recognition
- SQA: Speech Question Answering
- SI: Speech Instruction
- ST: Speech Translation
- ASR-CN: Automatic Speech Recognition for Chinese
- AR: Accent Recognition
- GR: Gender Recognition
- ER: Emotion Recognition
Dataset | Metrics | Status |
---|---|---|
LibriSpeech-Clean | Word-Error-Rate | β |
LibriSpeech-Other | Word-Error-Rate | β |
CommonVoice-15-EN | Word-Error-Rate | β |
Peoples-Speech | Word-Error-Rate | β |
GigaSpeech | Word-Error-Rate | β |
Earning21 | Word-Error-Rate | β |
Earning22 | Word-Error-Rate | β |
Tedlium3 | Word-Error-Rate | β |
Tedlium3-Longform | Word-Error-Rate | β |
export MODEL_NAME=whisper_large_v3_with_llama_3_8b_instruct
export GPU=3
export BATCH_SIZE=1
export OVERWRITE=False
export NUMBER_OF_SAMPLES=-1
bash examples/eval_sqa.sh
Dataset | Metrics | Status |
---|---|---|
CN-College-Listen | Model-as-Judge (binary) | β |
SLUE-P2-SQA5 | Model-as-Judge | β |
DREAM-TTS | Model-as-Judge (binary) | β |
Public-SG-SpeechQA | Model-as-Judge | β |
Spoken-SQuAD | Model-as-Judge | β |
bash examples/eval_sqa.sh
Dataset | Metrics | Status |
---|---|---|
OpenHermes-Audio | Model-as-Judge | β |
ALPACA-Audio | Model-as-Judge | β |
bash examples/eval_si.sh
Dataset | Metrics | Status |
---|---|---|
CoVost2-English-Indonesian | BLEU | β |
CoVost2-English-Chinese | BLEU | β |
CoVost2-English-Tamil | BLEU | β |
CoVost2-Indonesian-English | BLEU | β |
CoVost2-Chinese-English | BLEU | β |
CoVost2-Tamil-English | BLEU | β |
bash examples/eval_st.sh
Dataset | Metrics | Status |
---|---|---|
AISHELL-ASR-ZH | Word-Error-Rate | β |
bash examples/eval_asr_cn.sh
Dataset | Metrics | Status |
---|---|---|
AudioCaps | Model-as-Judge / METEOR | β |
WavCaps | Model-as-Judge / METEOR | β |
bash examples/eval_ac.sh
Dataset | Metrics | Status |
---|---|---|
Clotho-AQA | Model-as-Judge | β |
AudioCaps-QA | Model-as-Judge | β |
WavCaps-QA | Model-as-Judge | β |
bash examples/eval_asqa.sh
Dataset | Metrics | Status |
---|---|---|
VoxCeleb-Accent | Model-as-Judge | β |
bash examples/eval_ar.sh
Dataset | Metrics | Status |
---|---|---|
VoxCeleb-Gender | Model-as-Judge (binary) | β |
IEMOCAP-Gender | Model-as-Judge (binary) | β |
bash examples/eval_gr.sh
Dataset | Metrics | Status |
---|---|---|
IEMOCAP-Emotion | Model-as-Judge (binary) | β |
MELD-Sentiment | Model-as-Judge (binary) | β |
MELD-Emotion | Model-as-Judge (binary) | β |
bash examples/eval_er.sh
Dataset | Metrics | Status |
---|---|---|
MuChoMusic | Model-as-Judge (binary) | β |
bash examples/eval_music.sh
Name | Size | Notes | Status |
---|---|---|---|
Whisper-Large+Llama-3-8B-Instruct | ~8B | Cascade Models | β |
SALMONN | ~7B | End2End | β |
Qwen-Audio | ~8B | End2End | TODO |
WavLM | ~7B | End2End | TODO |
Qwen2-Audio | ~8B | End2End | TODO |
More models are accessible in this survey. To add a new model, please refer to Adding a New Model.
If you find our work useful, please consider citing our paper!
@article{wang2024audiobench,
title={AudioBench: A Universal Benchmark for Audio Large Language Models},
author={Wang, Bin and Zou, Xunlong and Lin, Geyu and Sun, Shuo and Liu, Zhuohan and Zhang, Wenyu and Liu, Zhengyuan and Aw, AiTi and Chen, Nancy F},
journal={arXiv preprint arXiv:2406.16020},
year={2024}
}
- Llama3-S: When Llama Learns to Listen
- More to come...