Stars
A paper list of some recent works about Token Compress for Vit and VLM
A suite of image and video neural tokenizers
Official Pytorch Implementation of Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think
🔥ImageFolder: Autoregressive Image Generation with Folded Tokens
O1 Replication Journey: A Strategic Progress Report – Part I
Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Repository for Show-o, One Single Transformer to Unify Multimodal Understanding and Generation.
[Official Implementation] Acoustic Autoregressive Modeling 🔥
text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)
Official Implementation of "Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining"
A framework for few-shot evaluation of language models.
PyTorch implementation of MAR+DiffLoss https://arxiv.org/abs/2406.11838
This is the official implementation for ControlVAR.
Autoregressive Model Beats Diffusion: 🦙 Llama for Scalable Image Generation
An unofficial implementation of both ViT-VQGAN and RQ-VAE in Pytorch
Anole: An Open, Autoregressive and Native Multimodal Models for Interleaved Image-Text Generation
This is a repo to track the latest autoregressive visual generation papers.
Repository for Meta Chameleon, a mixed-modal early-fusion foundation model from FAIR.
Official repo for the paper "Scaling Synthetic Data Creation with 1,000,000,000 Personas"
OpenMMLab Detection Toolbox and Benchmark
[NeurIPS 2024] Depth Anything V2. A More Capable Foundation Model for Monocular Depth Estimation
Cambrian-1 is a family of multimodal LLMs with a vision-centric design.
Open-MAGVIT2: Democratizing Autoregressive Visual Generation
TextGrad: Automatic ''Differentiation'' via Text -- using large language models to backpropagate textual gradients.