CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding

Hou, Zhijian; Zhong, Wanjun; Ji, Lei; Gao, Difei; Yan, Kun; Chan, Wing-Kwong; Ngo, Chong-Wah; Shou, Zheng; Duan, Nan

Computer Science > Computer Vision and Pattern Recognition

arXiv:2209.10918 (cs)

[Submitted on 22 Sep 2022 (v1), last revised 30 May 2023 (this version, v2)]

Title:CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding

Authors:Zhijian Hou, Wanjun Zhong, Lei Ji, Difei Gao, Kun Yan, Wing-Kwong Chan, Chong-Wah Ngo, Zheng Shou, Nan Duan

View PDF

Abstract:This paper tackles an emerging and challenging problem of long video temporal grounding~(VTG) that localizes video moments related to a natural language (NL) query. Compared with short videos, long videos are also highly demanded but less explored, which brings new challenges in higher inference computation cost and weaker multi-modal alignment. To address these challenges, we propose CONE, an efficient COarse-to-fiNE alignment framework. CONE is a plug-and-play framework on top of existing VTG models to handle long videos through a sliding window mechanism. Specifically, CONE (1) introduces a query-guided window selection strategy to speed up inference, and (2) proposes a coarse-to-fine mechanism via a novel incorporation of contrastive learning to enhance multi-modal alignment for long videos. Extensive experiments on two large-scale long VTG benchmarks consistently show both substantial performance gains (e.g., from 3.13% to 6.87% on MAD) and state-of-the-art results. Analyses also reveal higher efficiency as the query-guided window selection mechanism accelerates inference time by 2x on Ego4D-NLQ and 15x on MAD while keeping SOTA results. Codes have been released at this https URL.

Comments:	ACL 2023 Camera Ready. 14 pages, 7 figures, 4 tables
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR)
Cite as:	arXiv:2209.10918 [cs.CV]
	(or arXiv:2209.10918v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2209.10918

Submission history

From: Hou Zhijian [view email]
[v1] Thu, 22 Sep 2022 10:58:42 UTC (646 KB)
[v2] Tue, 30 May 2023 02:03:34 UTC (9,120 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators