Highlights
- Pro
Stars
Accelerating the development of large multimodal models (LMMs) with lmms-eval
Depth Anything V2. A More Capable Foundation Model for Monocular Depth Estimation
Generate interleaved text and image content in a structured format you can directly pass to downstream APIs.
MINT-1T: A one trillion token multimodal interleaved dataset.
[ECCV 2024] UMBRAE: Unified Multimodal Brain Decoding | Unveiling the 'Dark Side' of Brain Modality
Official implementation of MARS: Mixture of Auto-Regressive Models for Fine-grained Text-to-image Synthesis
This repo contains the code for our paper An Image is Worth 32 Tokens for Reconstruction and Generation
Public code repo for paper "A Single Transformer for Scalable Vision-Language Modeling"
Code for Fast Training of Diffusion Models with Masked Transformers
Code for the paper DisCo-Diff: Enhancing Continuous Diffusion Models with Discrete Latents, ICML 2024
Official Implementation of ICLR'24: Kosmos-G: Generating Images in Context with Multimodal Large Language Models
Anole: An Open, Autoregressive and Native Multimodal Models for Interleaved Image-Text Generation
The official PyTorch implementation of Google's Gemma models
The open-source tool for building high-quality datasets and computer vision models
Repository for Meta Chameleon, a mixed-modal early-fusion foundation model from FAIR.
A PyTorch implementation of the paper "Revisiting Non-Autoregressive Transformers for Efficient Image Synthesis"
Autoregressive Model Beats Diffusion: 🦙 Llama for Scalable Image Generation
🔥🔥🔥Latest Papers, Codes and Datasets on Vid-LLMs.
Latte: Latent Diffusion Transformer for Video Generation.
Official implementation of FIFO-Diffusion: Generating Infinite Videos from Text without Training
A massively parallel, high-level programming language
Hunyuan-DiT : A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding
PyTorch implementation of "Brain Decodes Deep Nets"