Lists (14)
Sort Name ascending (A-Z)
Stars
DSPy: The framework for programming—not prompting—foundation models
A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.
The code for "TokenPacker: Efficient Visual Projector for Multimodal LLM".
Official repo for paper "MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions"
The codes about "Uni-MoE: Scaling Unified Multimodal Models with Mixture of Experts"
The repository provides code for running inference with the Meta Segment Anything Model 2 (SAM 2), links for downloading the trained model checkpoints, and example notebooks that show how to use th…
✨✨Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Accelerating the development of large multimodal models (LMMs) with lmms-eval
Long Context Transfer from Language to Vision
Cambrian-1 is a family of multimodal LLMs with a vision-centric design.
GPT4V-level open-source multi-modal model based on Llama3-8B
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
MathVista: data, code, and evaluation for Mathematical Reasoning in Visual Contexts
Open-source evaluation toolkit of large vision-language models (LVLMs), support ~100 VLMs, 40+ benchmarks
[ECCV 2024] ShareGPT4V: Improving Large Multi-modal Models with Better Captions
An open-source implementation for training LLaVA-NeXT.
LLaVA-NeXT-Image-Llama3-Lora, Modified from https://github.com/arielnlee/LLaVA-1.6-ft
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
🔥🔥🔥Latest Papers, Codes and Datasets on Vid-LLMs.
🔎 Monitor deep learning model training and hardware usage from your mobile phone 📱
Project Page for "LISA: Reasoning Segmentation via Large Language Model"
Official repo for "AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability"
[Neurips 2023] T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation
⚡ Dynamically generated stats for your github readmes
PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
[ECCV 2024] Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"