GitHub - rikeilong/Bay-CAT: [ECCV’24] Official Implementation for CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios

Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios

Qilang Ye¹, Zitong Yu*¹, Rui Shao², Xinyu Xie¹, Philip Torr³, Xiaochun Cao⁴

¹ Great Bay University
² Harbin Institute of Technology, Shenzhen
³ University of Oxford
⁴ Shenzhen Campus of Sun Yat-sen University

*Corresponding author

News 📢

[07/2024] We have released the collected AVinstruct dataset.
[07/2024] Our work has been accepted by ECCV 2024!
[03/2024] Arxiv paper released.
[03/2024] Project page released.

Introduction 💡

We introduce the CAT, enhancing MLLM in three ways:
1) We design a clue aggregator that aggregates question-related clues in dynamic audio-visual scenarios to enrich the detailed knowledge required for large language models.
2) CAT is trained on a mixed multimodal dataset, allowing direct application in audio-visual scenarios. Notably, we collect an audio-visual joint instruction dataset named AVinstruct, to further enhance the capacity of CAT to model cross-semantic correlations.
3) We propose AI-assisted ambiguity-aware direct preference optimization, a strategy specialized in retraining the model to favor the non-ambiguity response and improve the ability to localize specific audio-visual objects.

Demo 🤗

Training & Validation

We have collect an audio-visual joint instruction dataset, named AVinstruct, details in Data.md.

Citation ✏️

If you find this work useful for your research, please kindly cite our paper and star our repo.

@misc{ye2024cat,
      title={CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios}, 
      author={Qilang Ye and Zitong Yu and Rui Shao and Xinyu Xie and Philip Torr and Xiaochun Cao},
      year={2024},
      eprint={2403.04640},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
AVinstruct		AVinstruct
assets		assets
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios

News 📢

Introduction 💡

Demo 🤗

Training & Validation

Citation ✏️

If you find this work useful for your research, please kindly cite our paper and star our repo.

About

Releases

Packages

License

rikeilong/Bay-CAT

Folders and files

Latest commit

History

Repository files navigation

Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios

News 📢

Introduction 💡

Demo 🤗

Training & Validation

Citation ✏️

If you find this work useful for your research, please kindly cite our paper and star our repo.

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages