-
Carnegie Mellon University
- Pittsburgh, PA
- https://lxa9867.github.io/
Stars
Official code for "ControlAR: Controllable Image Generation with Autoregressive Models"
A suite of image and video neural tokenizers
[NeurIPS 2024] Classification Done Right for Vision-Language Pre-Training
An efficient, flexible and full-featured toolkit for fine-tuning LLM (InternLM2, Llama3, Phi3, Qwen, Mistral, ...)
MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers
RobustSAM: Segment Anything Robustly on Degraded Images (CVPR 2024 Highlight)
Code for the paper: MACE: Leveraging Audio for Evaluating Audio Captioning Systems
The code for "TokenPacker: Efficient Visual Projector for Multimodal LLM".
[ECCV 2024] AdaNAT: Exploring Adaptive Policy for Token-Based Image Generation
Efficient vision foundation models for high-resolution generation and perception.
Official inference repo for FLUX.1 models
Codes accompanying the paper "Toward Guidance-Free AR Visual Generation via Condition Contrastive Alignment"
The paper collections for the autoregressive models in vision.
Official code for "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching"
Official Pytorch Implementation of Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think
HART: Efficient Visual Generation with Hybrid Autoregressive Transformer
CAR: Controllable AutoRegressive Modeling for Visual Generation
🔥ImageFolder: Autoregressive Image Generation with Folded Tokens
🔥🔥🔥 A curated list of papers on LLMs-based multimodal generation (image, video, 3D and audio).
AL-Ref-SAM 2: Unleashing the Temporal-Spatial Reasoning Capacity of GPT for Training-Free Audio and Language Referenced Video Object Segmentation
A paper list of some recent works about Token Compress for Vit and VLM
A Simple Yet Unified Self-supervised Pre-training Strategy for LiDAR-Camera 3D Perception
📖 This is a repository for organizing papers, codes and other resources related to unified multimodal models.
Repository for Show-o, One Single Transformer to Unify Multimodal Understanding and Generation.
Official PyTorch Implementation of "Scalable Autoregressive Image Generation with Mamba"
[Official Implementation] Acoustic Autoregressive Modeling 🔥
text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)
Official Implementation of "Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining"