Phenaki is a text-to-video model which is very similar to the normal text-to-image models that are learnt in a quantized & compressed latent space. Phenaki introduces a first-stage which spatially & temporally compresses the input videos (e.g. a video of shape 100 x 3 x 256 x 256 -> 20 x 32 x 32). This is achieved by temporal & spatial transformers. An interesting thing to note is that the temporal transformer is autoregressive, which eventually can be used to generate videos with variable length by a shifting context. After learning the first stage which can encode / compress & decode / uncompress videos well, the video-generation model is learned in the latent space. The paper uses MaskGIT for that.
We trained a convolutional 3D VQGAN with a spatial compression of f8 and temporal compression of f2. Videos of (10+1)x128x128 are encoded to a latent size of (5+1)x16x16. cViViT proposes to use a separate stem to encode the first frame. In our early experiments we saw that this stem would not receive a lot gradients and thus evolve very slowly, while the rest of the frames looked much better. As a result, we only use a single stem for all frames at once. To still enable image only training in the second stage, we learn an additional frame and prepend it to the start of the sequence, such that when downsampling temporally by 2, the learned and first frame would be encoded into one and the model could learn to ignore the learned embedding and only encode the information from the first frame. We trained the model (43M parameters) for 100k steps, with a batch size of 64 on 8 A100 for 1 day. In the following video the right one is the original and the left one is reconstructed, while in the table top rows represent the original frames and bottom are reconstructed.
1.mp4
- Implement cViViT
- Implement convolutional baseline for first stage
- Implement Loss (without video perceptual loss)
- Implement Training code
- Download dataset
- Data pipeline
- Efficient data pipeline
- Test first stage
- Small training first stage
- Full training first stage
- Implement MaskGIT
- Adjust data pipeline for second stage training (include image-only training data)
- Test second stage
- Small training second stage
- Full training second stage
- ....
- Activate KMeans in VQGAN training
- Move dataset to s3
- does the first-stage use a pretrained ViT as proposed in ViViT?
- how to do positional encoding? current approach
- best way to construct dataloader for videos?
- The dataset 'Moments in Time' does not have captions, and only contains labels. How are captions generated? "A video of {label}"?