Multimodal Needle in a Haystack (MMNeedle)

Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal LLMs
H. Wang, H. Shi, S. Tan, W. Qin, W. Wang, T. Zhang, A. Nambi, T. Ganu, H. Wang
[Paper] [MMNeedle Dataset]

To use this benchmark, please download the MMNeedle dataset at this link. Alternatively, you could also construct your version of MMNeedle by following the instructions.

News

(2024-06-24) We released the leaderboard for Multimodal Long Context Understanding on Paper with Code!

(2024-06-17) We released the paper, code, and data for Multimodal Needle in a Haystack (MMNeedle) benchmark!

Overview

MMNeedle Evaluation Overview. Correct answers are marked with checkmark ($\checkmark$), while the incorrect answers are marked with cross ($\times$). Our evaluation setup involves the following key components: (a) Needle Sub-Image: The needle sub-image to be retrieved based on the given caption. (b) Haystack Image Inputs: The long-context visual inputs consist of M images, each stitched from N $\times$ N sub-images. (c) Text Inputs (Instructions and Caption): Detailed instructions to MLLMs, followed by a caption describing the needle, i.e., sub-image 20. (d) LLM Outputs: The answers from different MLLMs, indicating their ability to accurately locate the needle in the haystack based on the given caption. The expected output is composed of the model's identification of the index, row, and column of the matching sub-image. The results showcase the comparative performance of various models: GPT-4o correctly predicts the exact location of the needle; Gemini Pro 1.5 only correctly predicts the image index of the needle; other API models predict incorrect locations; open-source models often output with wrong formats.

MMNeedle Evaluation Performance Comparison (Claude-3 refers to Claude 3 Opus, and Gemini-1.0/1.5 refers to Gemini Pro 1.0/1.5). The x-axis shows the results of different models, and the y-axis shows the results on various input image number M and stitching size N. For each row, i.e., setting (M,N), we show the average accuracy (%) of each model. For each stitched image, the color of row r, column c indicates the accuracy of predicting the exact position for samples with the "needle" sub-image in position (r,c) of the stitched image. For the M=10 setting, we show the average accuracy of each location (r,c) over 10 images. A redder cell indicates lower accuracy, while a greener cell indicates higher accuracy. The best result for each row is marked with underlining.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
files		files
README.md		README.md
index.htm		index.htm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multimodal Needle in a Haystack (MMNeedle)

News

Overview

About

Releases

Packages

Contributors 2

Languages

mmneedle/mmneedle.github.io

Folders and files

Latest commit

History

Repository files navigation

Multimodal Needle in a Haystack (MMNeedle)

News

Overview

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages