Video Moment Retrieval from Text Queries via Single Frame Annotation

Cui, Ran; Qian, Tianwen; Peng, Pai; Daskalaki, Elena; Chen, Jingjing; Guo, Xiaowei; Sun, Huyang; Jiang, Yu-Gang

doi:10.1145/3477495.3532078

Computer Science > Computer Vision and Pattern Recognition

arXiv:2204.09409 (cs)

[Submitted on 20 Apr 2022 (v1), last revised 18 Jun 2022 (this version, v3)]

Title:Video Moment Retrieval from Text Queries via Single Frame Annotation

Authors:Ran Cui, Tianwen Qian, Pai Peng, Elena Daskalaki, Jingjing Chen, Xiaowei Guo, Huyang Sun, Yu-Gang Jiang

View PDF

Abstract:Video moment retrieval aims at finding the start and end timestamps of a moment (part of a video) described by a given natural language query. Fully supervised methods need complete temporal boundary annotations to achieve promising results, which is costly since the annotator needs to watch the whole moment. Weakly supervised methods only rely on the paired video and query, but the performance is relatively poor. In this paper, we look closer into the annotation process and propose a new paradigm called "glance annotation". This paradigm requires the timestamp of only one single random frame, which we refer to as a "glance", within the temporal boundary of the fully supervised counterpart. We argue this is beneficial because comparing to weak supervision, trivial cost is added yet more potential in performance is provided. Under the glance annotation setting, we propose a method named as Video moment retrieval via Glance Annotation (ViGA) based on contrastive learning. ViGA cuts the input video into clips and contrasts between clips and queries, in which glance guided Gaussian distributed weights are assigned to all clips. Our extensive experiments indicate that ViGA achieves better results than the state-of-the-art weakly supervised methods by a large margin, even comparable to fully supervised methods in some cases.

Comments:	Accepted as full paper in SIGIR 2022
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2204.09409 [cs.CV]
	(or arXiv:2204.09409v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2204.09409
Related DOI:	https://doi.org/10.1145/3477495.3532078

Submission history

From: Ran Cui [view email]
[v1] Wed, 20 Apr 2022 11:59:17 UTC (8,191 KB)
[v2] Tue, 26 Apr 2022 12:14:41 UTC (8,191 KB)
[v3] Sat, 18 Jun 2022 12:56:41 UTC (8,191 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Video Moment Retrieval from Text Queries via Single Frame Annotation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Video Moment Retrieval from Text Queries via Single Frame Annotation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators