MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Yue, Xiang; Ni, Yuansheng; Zhang, Kai; Zheng, Tianyu; Liu, Ruoqi; Zhang, Ge; Stevens, Samuel; Jiang, Dongfu; Ren, Weiming; Sun, Yuxuan; Wei, Cong; Yu, Botao; Yuan, Ruibin; Sun, Renliang; Yin, Ming; Zheng, Boyuan; Yang, Zhenzhu; Liu, Yibo; Huang, Wenhao; Sun, Huan; Su, Yu; Chen, Wenhu

Computer Science > Computation and Language

arXiv:2311.16502v1 (cs)

[Submitted on 27 Nov 2023 (this version), latest version 13 Jun 2024 (v4)]

Title:MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

View PDF

Abstract:We introduce MMMU: a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. MMMU includes 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. These questions span 30 subjects and 183 subfields, comprising 30 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures. Unlike existing benchmarks, MMMU focuses on advanced perception and reasoning with domain-specific knowledge, challenging models to perform tasks akin to those faced by experts. Our evaluation of 14 open-source LMMs and the proprietary GPT-4V(ision) highlights the substantial challenges posed by MMMU. Even the advanced GPT-4V only achieves a 56% accuracy, indicating significant room for improvement. We believe MMMU will stimulate the community to build next-generation multimodal foundation models towards expert artificial general intelligence.

Comments:	115 pages, 99 figures
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2311.16502 [cs.CL]
	(or arXiv:2311.16502v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2311.16502

Submission history

From: Xiang Yue [view email]
[v1] Mon, 27 Nov 2023 17:33:21 UTC (16,490 KB)
[v2] Mon, 18 Dec 2023 03:47:39 UTC (17,526 KB)
[v3] Thu, 21 Dec 2023 04:06:49 UTC (17,526 KB)
[v4] Thu, 13 Jun 2024 15:02:39 UTC (17,534 KB)

Computer Science > Computation and Language

Title:MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators