Lists (7)
Sort Name ascending (A-Z)
Stars
A PyTorch implementation of MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models
A bibliography and survey of the papers surrounding o1
Emerging Pixel Grounding in Large Multimodal Models Without Grounding Supervision
O1 Replication Journey: A Strategic Progress Report – Part I
A paper list of some recent works about Token Compress for Vit and VLM
[NeurIPS 2024] Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding
✨✨ MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
Textureless Underwater Real Time Localization and Mapping
An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
[CVPR 2024] Situational Awareness Matters in 3D Vision Language Reasoning
We propose CRKD to bridge the performance gap between LC and CR detectors with a novel cross-modality knowledge distillation (KD) framework.
Open-MAGVIT2: Democratizing Autoregressive Visual Generation
official repository of CVPR 2024 paper, RMem: Restricted Memory Banks Improve Video Object Segmentation
Taming Transformers for High-Resolution Image Synthesis
Official repository for "AM-RADIO: Reduce All Domains Into One"
[ICCV 2023] Multi3DRefer: Grounding Text Description to Multiple 3D Objects
Accelerating the development of large multimodal models (LMMs) with lmms-eval
[COLM-2024] List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs
Reaching LLaMA2 Performance with 0.1M Dollars
✨✨Latest Advances on Multimodal Large Language Models
[ECCV2024] API code for T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy
[NeurIPS 2024 Oral][GPT beats diffusion🔥] [scaling laws in visual generation📈] Official impl. of "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction". An *ultra-sim…
Efficient vision foundation models for high-resolution generation and perception.
[NeurIPS 2024] SWE-agent takes a GitHub issue and tries to automatically fix it, using GPT-4, or your LM of choice. It can also be employed for offensive cybersecurity or competitive coding challen…