MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

Song, Enxin; Chai, Wenhao; Wang, Guanhong; Zhang, Yucheng; Zhou, Haoyang; Wu, Feiyang; Chi, Haozhe; Guo, Xun; Ye, Tian; Zhang, Yanting; Lu, Yan; Hwang, Jenq-Neng; Wang, Gaoang

Computer Science > Computer Vision and Pattern Recognition

arXiv:2307.16449 (cs)

[Submitted on 31 Jul 2023 (v1), last revised 9 Mar 2024 (this version, v4)]

Title:MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

Authors:Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, Yan Lu, Jenq-Neng Hwang, Gaoang Wang

View PDF HTML (experimental)

Abstract:Recently, integrating video foundation models and large language models to build a video understanding system can overcome the limitations of specific pre-defined vision tasks. Yet, existing systems can only handle videos with very few frames. For long videos, the computation complexity, memory cost, and long-term temporal connection impose additional challenges. Taking advantage of the Atkinson-Shiffrin memory model, with tokens in Transformers being employed as the carriers of memory in combination with our specially designed memory mechanism, we propose the MovieChat to overcome these challenges. MovieChat achieves state-of-the-art performance in long video understanding, along with the released MovieChat-1K benchmark with 1K long video and 14K manual annotations for validation of the effectiveness of our method.

Comments:	CVPR 2024. First three authors contribute equally to this work. Project Website this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2307.16449 [cs.CV]
	(or arXiv:2307.16449v4 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2307.16449

Submission history

From: Wenhao Chai [view email]
[v1] Mon, 31 Jul 2023 07:15:45 UTC (1,088 KB)
[v2] Fri, 24 Nov 2023 02:43:18 UTC (4,453 KB)
[v3] Sun, 3 Dec 2023 00:51:13 UTC (4,454 KB)
[v4] Sat, 9 Mar 2024 06:43:37 UTC (4,828 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators