Block or Report
Block or report linzhenyuyuchen
Contact GitHub support about this user’s behavior. Learn more about reporting abuse.
Report abuseLanguage
Sort by: Recently starred
Starred repositories
Accelerating the development of large multimodal models (LMMs) with lmms-eval
[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4V. 接近GPT-4V表现的可商用开源多模态对话模型
Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"
【ICLR 2024🔥】 Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images
Recent LLM-based CV and related works. Welcome to comment/contribute!
A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
🔥🔥🔥Latest Papers, Codes and Datasets on Vid-LLMs.
Mora: More like Sora for Generalist Video Generation
Implementation of paper - YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
OpenMMLab YOLO series toolbox and benchmark. Implemented RTMDet, RTMDet-Rotated,YOLOv5, YOLOv6, YOLOv7, YOLOv8,YOLOX, PPYOLOE, etc.
YOLOX is a high-performance anchor-free YOLO, exceeding yolov3~v5 with MegEngine, ONNX, TensorRT, ncnn, and OpenVINO supported. Documentation: https://yolox.readthedocs.io/
[CVPR 2024] Real-Time Open-Vocabulary Object Detection
【CVPR 2024 Highlight】Monkey (LMM): Image Resolution and Text Label Are Important Things for Large Multi-modal Models
[CVPR2024] Generative Region-Language Pretraining for Open-Ended Object Detection
[ECCV 2024] Code for paper: An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models
Official GitHub repository for the paper "LingoQA: Video Question Answering for Autonomous Driving"
[ECCV 2024] Embodied Understanding of Driving Scenarios
CLIP+MLP Aesthetic Score Predictor
[CVPR 2024] Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
[CVPR2024] ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
Official repository of NEFTune: Noisy Embeddings Improves Instruction Finetuning
Gradio demo used in our Osprey:Pixel Understanding with Visual Instruction Tuning.
[CVPR2024] The code for "Osprey: Pixel Understanding with Visual Instruction Tuning"