Skip to content

MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering. A comprehensive evaluation of multimodal large model multilingual text perception and comprehension capabilities across nine widely-used yet low-resource languages.

Notifications You must be signed in to change notification settings

bytedance/MTVQA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 

Repository files navigation

MTVQA

MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering

Text-Centric Visual Question Answering (TEC-VQA) in its proper format not only facilitates human-machine interaction in text-centric visual environments but also serves as a de facto gold proxy to evaluate AI models in the domain of text-centric scene understanding. Nonetheless, most existing TEC-VQA benchmarks have focused on high-resource languages like English and Chinese. Despite pioneering works to expand multilingual QA pairs in non-text-centric VQA datasets through translation engines, the translation-based protocol encounters a substantial ''Visual-textual misalignment'' problem when applied to TEC-VQA. Specifically, it prioritizes the text in question-answer pairs while disregarding the visual text present in images. Furthermore, it fails to address complexities related to nuanced meaning, contextual distortion, language bias, and question-type diversity. In this work, we tackle multilingual TEC-VQA by introducing MTVQA, the first benchmark featuring high-quality human expert annotations across 9 diverse languages. Further, by comprehensively evaluating numerous state-of-the-art Multimodal Large Language Models (MLLMs), including GPT-4o, GPT-4V, Claude3, and Gemini, on the MTVQA dataset, it is evident that there is still large room for performance improvement, underscoring the value of the dataset. Additionally, we supply multilingual training data within the MTVQA dataset, demonstrating that straightforward fine-tuning with this data can substantially enhance multilingual TEC-VQA performance. We aspire that MTVQA will offer the research community fresh insights and stimulate further exploration in multilingual visual text comprehension.

|🍎 Project Page | 📖 Paper |📊 Dataset | 🏆 Leaderboard

🔥 News

  • 2024.08.30 🌟 Qwen2VL 72B is released, outperforming GPT-4o and achieving the best performance overall, congratulations!
  • 2024.07.23 🌟 MTVQA is now supported in VLMEvalKit.
  • 2024.07.23 🌟 MTVQA is now supported in OpenCompass.
  • 2024.06.04 🌟 We are excited to launch MTVQA, the first multilingual visual text comprehension evaluation benchmark for MLLMs! MTVQA includes 9 widely-used but low-resource languages, i.t., AR, DE, FR, IT, JA, KO, RU, TH, and VI.
  • 2024.06.04 🌟 GPT-4o achieves the best performance overall, MiniCPM-V2.5 achieves the best performance among open-source models!

👀 Data

| RawData (Google Drive) | Huggingface Dataset

🔮 Evaluation

The test code for evaluating models in the paper can be found in scripts.

If you want to add your results to the MTVQA leaderboard, feel free to email us directly at [email protected] or [email protected].

🏆 LeaderBoard

Models Open-Source AR DE FR IT JA KO RU TH VI Average
Qwen2VL 72B🥇 - - - - - - - - - 32.6
GPT-4o 🥈 20.2 34.2 41.2 32.7 20.0 33.9 11.5 22.5 34.2 27.8
Claude3 Opus 🥉 15.1 33.4 40.6 34.4 19.4 27.2 13.0 19.5 29.1 25.7
Gemini Ultra 14.7 32.3 40.0 31.8 12.3 17.2 11.8 20.3 28.6 23.2
GPT-4V 11.5 31.5 40.4 32.3 11.5 16.7 10.3 15.0 28.9 22.0
QwenVL Max 7.7 31.4 37.6 30.2 18.6 25.4 10.4 4.8 23.5 21.1
Claude3 Sonnet 10.5 28.9 35.6 31.8 13.9 22.2 11.0 15.2 20.8 21.1
QwenVL Plus 4.8 28.8 33.7 27.1 12.8 19.9 9.4 5.6 18.1 17.8
MiniCPM-V2.5 6.1 29.6 35.7 26.0 12.1 13.1 5.7 12.6 15.3 17.3
InternVL-V1.5 3.4 27.1 31.4 27.1 9.9 9.0 4.9 8.7 12.4 14.9
GLM4V 0.3 30.0 34.1 30.1 3.4 5.7 3.0 3.5 12.3 13.6
TextSquare 3.7 27.0 30.8 26.7 3.2 7.2 6.7 5.2 12.4 13.6
Mini-Gemini-HD-34B 2.2 25.0 29.2 25.5 6.1 8.6 4.1 4.3 11.8 13.0
InternLM-Xcomposer2-4KHD 2.0 20.6 23.2 21.6 5.6 7.7 4.1 6.1 10.1 11.2
Llava-Next-34B 3.3 24.0 28.0 22.3 3.6 6.1 2.6 0.4 9.8 11.1
TextMonkey 2.0 18.1 19.9 22.1 4.6 7.2 3.2 0.9 11.1 9.9
MiniCPM-V2.0 1.3 12.7 14.9 17.0 3.7 5.6 2.2 2.2 6.8 7.4
mPLUG-DocOwl 1.5 1.0 13.9 14.9 18.2 2.9 5.0 2.0 0.9 6.4 7.2
YI-VL-34B 1.7 13.5 15.7 12.1 4.8 5.2 0.8 3.5 4.1 6.8
DeepSeek-VL 0.6 14.2 15.3 15.2 2.9 3.8 1.6 0.9 5.2 6.6

✒️ Citation

If you wish to refer to the baseline results published here, please use the following BibTeX entries:

@misc{tang2024mtvqa,
      title={MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering}, 
      author={Jingqun Tang and Qi Liu and Yongjie Ye and Jinghui Lu and Shu Wei and Chunhui Lin and Wanqing Li and Mohamad Fitri Faiz Bin Mahmood and Hao Feng and Zhen Zhao and Yanjie Wang and Yuliang Liu and Hao Liu and Xiang Bai and Can Huang},
      year={2024},
      eprint={2405.11985},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Licence

CC BY-NC 4.0

Bias, Risks, and Limitations

Your access to and use of this dataset are at your own risk. We do not guarantee the accuracy of this dataset. The dataset is provided “as is” and we make no warranty or representation to you with respect to it and we expressly disclaim, and hereby expressly waive, all warranties, express, implied, statutory or otherwise. This includes, without limitation, warranties of quality, performance, merchantability or fitness for a particular purpose, non-infringement, absence of latent or other defects, accuracy, or the presence or absence of errors, whether or not known or discoverable. In no event will we be liable to you on any legal theory (including, without limitation, negligence) or otherwise for any direct, special, indirect, incidental, consequential, punitive, exemplary, or other losses, costs, expenses, or damages arising out of this public license or use of the licensed material. The disclaimer of warranties and limitation of liability provided above shall be interpreted in a manner that, to the extent possible, most closely approximates an absolute disclaimer and waiver of all liability.

About

MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering. A comprehensive evaluation of multimodal large model multilingual text perception and comprehension capabilities across nine widely-used yet low-resource languages.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages