Block or Report
Block or report forrestbing
Contact GitHub support about this user’s behavior. Learn more about reporting abuse.
Report abuseStars
Language
Sort by: Recently starred
The code for "TokenPacker: Efficient Visual Projector for Multimodal LLM".
EVE: Encoder-Free Vision-Language Models from BAAI
[ECCV2024] API code for T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy
Odyssey: Empowering Agents with Open-World Skills
🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.
a state-of-the-art-level open visual language model | 多模态预训练模型
LoRA-Composer: Leveraging Low-Rank Adaptation for Multi-Concept Customization in Training-Free Diffusion Models
PixelLM is an effective and efficient LMM for pixel-level reasoning and understanding. PixelLM is accepted by CVPR 2024.
Analysis of Chinese and English layouts 中英文版面分析
Dataset and Code for our ACL 2024 paper: "Multimodal Table Understanding". We propose the first large-scale Multimodal IFT and Pre-Train Dataset for table understanding and develop a generalist tab…
Align Anything: Training Any Modality Model with Feedback
A family of compressed models obtained via pruning and knowledge distillation
AI agent using GPT-4V(ision) capable of using a mouse/keyboard to interact with web UI
Official repository for paper MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning(https://arxiv.org/abs/2406.17770).
Awesome OVD-OVS - A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future
A minimal codebase for finetuning large multimodal models, supporting llava-1.5, qwen-vl, llava-interleave, llava-next-video, phi3-v etc.
Cambrian-1 is a family of multimodal LLMs with a vision-centric design.
An Open-source Toolkit for LLM Development
Official code for Paper "Mantis: Multi-Image Instruction Tuning"
Text-Guided Generation of Full-Body Image with Preserved Reference Face for Customized Animation
[CVPR 2024] Code release for "Unsupervised Universal Image Segmentation"
DeepSeek-VL: Towards Real-World Vision-Language Understanding
The official repo of Qwen-VL (通义千问-VL) chat & pretrained large vision language model proposed by Alibaba Cloud.
GPT4V-level open-source multi-modal model based on Llama3-8B