MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation

Voleti, Vikram; Jolicoeur-Martineau, Alexia; Pal, Christopher

Computer Science > Computer Vision and Pattern Recognition

arXiv:2205.09853 (cs)

[Submitted on 19 May 2022 (v1), last revised 12 Oct 2022 (this version, v4)]

Title:MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation

Authors:Vikram Voleti, Alexia Jolicoeur-Martineau, Christopher Pal

View PDF

Abstract:Video prediction is a challenging task. The quality of video frames from current state-of-the-art (SOTA) generative models tends to be poor and generalization beyond the training data is difficult. Furthermore, existing prediction frameworks are typically not capable of simultaneously handling other video-related tasks such as unconditional generation or interpolation. In this work, we devise a general-purpose framework called Masked Conditional Video Diffusion (MCVD) for all of these video synthesis tasks using a probabilistic conditional score-based denoising diffusion model, conditioned on past and/or future frames. We train the model in a manner where we randomly and independently mask all the past frames or all the future frames. This novel but straightforward setup allows us to train a single model that is capable of executing a broad range of video tasks, specifically: future/past prediction -- when only future/past frames are masked; unconditional generation -- when both past and future frames are masked; and interpolation -- when neither past nor future frames are masked. Our experiments show that this approach can generate high-quality frames for diverse types of videos. Our MCVD models are built from simple non-recurrent 2D-convolutional architectures, conditioning on blocks of frames and generating blocks of frames. We generate videos of arbitrary lengths autoregressively in a block-wise manner. Our approach yields SOTA results across standard video prediction and interpolation benchmarks, with computation times for training models measured in 1-12 days using $\le$ 4 GPUs. Project page: this https URL ; Code : this https URL

Comments:	NeurIPS 2022 ; 10 pages, 4 figures, 7 tables
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2205.09853 [cs.CV]
	(or arXiv:2205.09853v4 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2205.09853

Submission history

From: Vikram Voleti [view email]
[v1] Thu, 19 May 2022 20:58:05 UTC (9,618 KB)
[v2] Mon, 23 May 2022 21:55:27 UTC (9,618 KB)
[v3] Mon, 30 May 2022 15:59:55 UTC (5,974 KB)
[v4] Wed, 12 Oct 2022 19:33:40 UTC (3,948 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation

Submission history

Access Paper:

References & Citations

2 blog links

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation

Submission history

Access Paper:

References & Citations

2 blog links

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators