An end-to-end MLLM that can accept any-form referring and ground anything in response.*
Haoxuan You*, Haotian Zhang*, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, Yinfei Yang [*: equal contribution]
Key Contributions:
- Ferret Model - Hybrid Region Representation + Spatial-aware Visual Sampler enable fine-grained and open-vocabulary referring and grounding in MLLM.
- GRIT Dataset (~1.1M) - A Large-scale, Hierarchical, Robust ground-and-refer instruction tuning dataset.
- Ferret-Bench - A multimodal evaluation benchmark that jointly requires Referring/Grounding, Semantics, Knowledge, and Reasoning.