From Text to Pixel: Advancing Long-Context Understanding in MLLMs

We introduce SEEKER, a multimodal large language model designed to tackle this issue. SEEKER aims to optimize the compact encoding of long text by compressing the text sequence into the visual pixel space via images, enabling the model to handle long text within a fixed token-length budget efficiently..

Overview

Left: Performance plot on First-Sentence-Retrieval task revealing compact nature of image tokens in representing long content. Right: Radar chart demonstrating the superior performance of the SEEKER (ours) model across both short and long-context multimodal tasks.

Our SEEKER surpass OCR-based model on long multimodal context tasks: 1) process multiple text-rich images naturally. 2) more compact token and fit easily in fix-context length LLM.

Main quantitative results - Long Image and Text Context.

Main quantitative results - Short Image and Text Context.

Compact Context Length and Inference Efficiency

Training and Inference

To Be Released

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

From Text to Pixel: Advancing Long-Context Understanding in MLLMs

Overview

Training and Inference

About

Releases

Packages

License

YujieLu10/Seeker

Folders and files

Latest commit

History

Repository files navigation

From Text to Pixel: Advancing Long-Context Understanding in MLLMs

Overview

Training and Inference

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages