Can CLIP Help Sound Source Localization?

Park, Sooyoung; Senocak, Arda; Chung, Joon Son

Computer Science > Computer Vision and Pattern Recognition

arXiv:2311.04066 (cs)

[Submitted on 7 Nov 2023]

Title:Can CLIP Help Sound Source Localization?

Authors:Sooyoung Park, Arda Senocak, Joon Son Chung

View PDF

Abstract:Large-scale pre-trained image-text models demonstrate remarkable versatility across diverse tasks, benefiting from their robust representational capabilities and effective multimodal alignment. We extend the application of these models, specifically CLIP, to the domain of sound source localization. Unlike conventional approaches, we employ the pre-trained CLIP model without explicit text input, relying solely on the audio-visual correspondence. To this end, we introduce a framework that translates audio signals into tokens compatible with CLIP's text encoder, yielding audio-driven embeddings. By directly using these embeddings, our method generates audio-grounded masks for the provided audio, extracts audio-grounded image features from the highlighted regions, and aligns them with the audio-driven embeddings using the audio-visual correspondence objective. Our findings suggest that utilizing pre-trained image-text models enable our model to generate more complete and compact localization maps for the sounding objects. Extensive experiments show that our method outperforms state-of-the-art approaches by a significant margin.

Comments:	WACV 2024
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2311.04066 [cs.CV]
	(or arXiv:2311.04066v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2311.04066

Submission history

From: Arda Senocak [view email]
[v1] Tue, 7 Nov 2023 15:26:57 UTC (8,046 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Can CLIP Help Sound Source Localization?

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Can CLIP Help Sound Source Localization?

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators