Lists (1)
Sort Name ascending (A-Z)
Stars
[ICML 2024] 3D-VLA: A 3D Vision-Language-Action Generative World Model
Implementation of Phenaki Video, which uses Mask GIT to produce text guided videos of up to 2 minutes in length, in Pytorch
This repository compiles a list of papers related to the application of video technology in the field of robotics! Star⭐ the repo and follow me if you like what you see🤩.
Training code for the videocrafter.
RoyZry98 / MMTrail-Pytorch
Forked from litwellchi/MMTrail[Arxiv 2024] Official code for MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions
A curated list of Diffusion Model in RL resources (continually updated)
world modeling challenge for humanoid robots
Pointcept: a codebase for point cloud perception research. Latest works: PTv3 (CVPR'24 Oral), PPT (CVPR'24), OA-CNNs (CVPR'24), MSC (CVPR'23)
Open-Sora: Democratizing Efficient Video Production for All
A curated list of awesome papers on Embodied AI and related research/industry-driven resources.
AlignProp uses direct reward backpropogation for the alignment of large-scale text-to-image diffusion models. Our method is 25x more sample and compute efficient than reinforcement learning methods…
[Arxiv 2024] Official code for MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions
This repo contains the code for our paper An Image is Worth 32 Tokens for Reconstruction and Generation
Official repo of the paper "MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos"
CALVIN - A benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks
An open-source framework for training large multimodal models.
Pandora: Towards General World Model with Natural Language Actions and Video States
data pipeline code of large video generation model
[CVPR 2024] Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners
Stable Video Diffusion Training Code and Extensions.
Latte: Latent Diffusion Transformer for Video Generation.
LaVIT: Empower the Large Language Model to Understand and Generate Visual Content
The official site of paper MMDialog: A Large-scale Multi-turn Dialogue Dataset Towards Multi-modal Open-domain Conversation
Official JAX implementation of MAGVIT: Masked Generative Video Transformer
Implementation of MagViT2 Tokenizer in Pytorch
🔥🔥🔥 A curated list of papers on LLMs-based multimodal generation (image, video, 3D and audio).