Stars
[NeurIPS 2024] Matryoshka Query Transformer for Large Vision-Language Models
The code for "TokenPacker: Efficient Visual Projector for Multimodal LLM".
A paper list of some recent works about Token Compress for Vit and VLM
Official code implementation of General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
Official implementation of "LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference"
Effortless data labeling with AI support from Segment Anything and other awesome models.
MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer
✨✨ MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
【CVPR 2024 Highlight】Monkey (LMM): Image Resolution and Text Label Are Important Things for Large Multi-modal Models
[ECCV2024 Oral🔥] Official Implementation of "GiT: Towards Generalist Vision Transformer through Universal Language Interface"
LAVIS - A One-stop Library for Language-Vision Intelligence
DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception
Public code repo for paper "A Single Transformer for Scalable Vision-Language Modeling"
[NeurIPS'24 Spotlight] EVE: Encoder-Free Vision-Language Models
Exploration of the multi modal fuyu-8b model of Adept. 🤓 🔍
🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.
OMG-LLaVA and OMG-Seg codebase [CVPR-24 and NeurIPS-24]
A curated list for Efficient Large Language Models
(ECCV 2024) Empowering Multimodal Large Language Model as a Powerful Data Generator
[ECCV 2024 Oral] Code for paper: An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models
[ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenization
Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning
[ECCV 2024] official code for "Long-CLIP: Unlocking the Long-Text Capability of CLIP"
An official implementation for "CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval"