Skip to content

YujieLu10/Seeker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

From Text to Pixel: Advancing Long-Context Understanding in MLLMs

📃 Paper

We introduce SEEKER, a multimodal large language model designed to tackle this issue. SEEKER aims to optimize the compact encoding of long text by compressing the text sequence into the visual pixel space via images, enabling the model to handle long text within a fixed token-length budget efficiently..

Overview

Left: Performance plot on First-Sentence-Retrieval task revealing compact nature of image tokens in representing long content. Right: Radar chart demonstrating the superior performance of the SEEKER (ours) model across both short and long-context multimodal tasks.

Our SEEKER surpass OCR-based model on long multimodal context tasks: 1) process multiple text-rich images naturally. 2) more compact token and fit easily in fix-context length LLM.

Main quantitative results - Long Image and Text Context.

Main quantitative results - Short Image and Text Context.

Compact Context Length and Inference Efficiency

Training and Inference

To Be Released

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published