-
Apple AI/ML
- Cupertino, CA
- haotian-zhang.github.io/
- @HaotianZhang4AI
Block or Report
Block or report Haotian-Zhang
Contact GitHub support about this user’s behavior. Learn more about reporting abuse.
Report abuseStars
Language
Sort by: Recently starred
[ECCV 2024] Official Repository for DiffiT: Diffusion Vision Transformers for Image Generation
🔥stable, simple, state-of-the-art VQVAE toolkit & cookbook
Code&Data for Grounded 3D-LLM with Referent Tokens
[CVPR 2024] "LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning"; an interactive Large Language 3D Assistant.
Open-source evaluation toolkit of large vision-language models (LVLMs), support GPT-4v, Gemini, QwenVLPlus, 50+ HF models, 20+ benchmarks
Code for 3D-LLM: Injecting the 3D World into Large Language Models
MiniCPM-2B: An end-side LLM outperforming Llama2-13B.
Implementation of Infini-Transformer in Pytorch
Lumina-T2X is a unified framework for Text to Any Modality Generation
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
Vector (and Scalar) Quantization, in Pytorch
[CVPR 2024] 🎬💭 chat with over 10K frames of video!
Code for "AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling"
Code for V-IRL: Grounding Virtual Intelligence in Real Life
Taming Transformers for High-Resolution Image Synthesis
[GPT beats diffusion🔥] [scaling laws in visual generation📈] Official impl. of "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction". An *ultra-simple, user-friendly …
Emu Series: Generative Multimodal Models from BAAI
Official implementation of SEED-LLaMA (ICLR 2024).
【CVPR 2024 Highlight】Monkey (LMM): Image Resolution and Text Label Are Important Things for Large Multi-modal Models
PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation
Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"
[CVPR2024 Highlight]GLEE: General Object Foundation Model for Images and Videos at Scale
When do we not need larger vision models?