ImageBind: One Embedding Space To Bind Them All

Girdhar, Rohit; El-Nouby, Alaaeldin; Liu, Zhuang; Singh, Mannat; Alwala, Kalyan Vasudev; Joulin, Armand; Misra, Ishan

Computer Science > Computer Vision and Pattern Recognition

arXiv:2305.05665 (cs)

[Submitted on 9 May 2023 (v1), last revised 31 May 2023 (this version, v2)]

Title:ImageBind: One Embedding Space To Bind Them All

Authors:Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra

View PDF

Abstract:We present ImageBind, an approach to learn a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. We show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together. ImageBind can leverage recent large scale vision-language models, and extends their zero-shot capabilities to new modalities just by using their natural pairing with images. It enables novel emergent applications 'out-of-the-box' including cross-modal retrieval, composing modalities with arithmetic, cross-modal detection and generation. The emergent capabilities improve with the strength of the image encoder and we set a new state-of-the-art on emergent zero-shot recognition tasks across modalities, outperforming specialist supervised models. Finally, we show strong few-shot recognition results outperforming prior work, and that ImageBind serves as a new way to evaluate vision models for visual and non-visual tasks.

Comments:	CVPR 2023 (Highlighted Paper). Website: this https URL Code/Models: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
Cite as:	arXiv:2305.05665 [cs.CV]
	(or arXiv:2305.05665v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2305.05665

Submission history

From: Rohit Girdhar [view email]
[v1] Tue, 9 May 2023 17:59:07 UTC (4,635 KB)
[v2] Wed, 31 May 2023 04:57:12 UTC (4,636 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:ImageBind: One Embedding Space To Bind Them All

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:ImageBind: One Embedding Space To Bind Them All

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators