![ai logo](https://raw.githubusercontent.com/github/explore/558a9d7bbfd1683934210d9500c1e0c8b8c50f77/topics/ai/ai.png)
Block or Report
Block or report yuanrr
Contact GitHub support about this user’s behavior. Learn more about reporting abuse.
Report abuseLanguage
Sort by: Recently starred
Starred repositories
When do we not need larger vision models?
Official repository for paper MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning(https://arxiv.org/abs/2406.17770).
i-SRT:Aligning Large Multimodal Models for Videos by Iterative Self-Retrospective Judgement
This repo contains the codes for supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) designed for vision LLMs.
IVGSZ / Flash-VStream
Forked from IVG-SZ/Flash-VStreamThis is the official implementation of "Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams"
📖 A curated list of resources dedicated to hallucination of multimodal large language models (MLLM).
[Arxiv] Aligning Modalities in Vision Large Language Models via Preference Fine-tuning
Enhancing Large Vision Language Models with Self-Training on Image Comprehension.
This repository contains demos I made with the Transformers library by HuggingFace.
LLaVA-NeXT-Image-Llama3-Lora, Modified from https://github.com/arielnlee/LLaVA-1.6-ft
An open-source implementation of LLaVA-NeXT.
Official implementation of "LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference"
(2024CVPR) MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
A lightweight flexible Video-MLLM developed by TencentQQ Multimedia Research Team.
Scenic: A Jax Library for Computer Vision Research and Beyond
Long Context Transfer from Language to Vision
Implementation of the paper: "Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention" from Google in pyTORCH
The codes about "Uni-MoE: Scaling Unified Multimodal Models with Mixture of Experts"
This is the official repository of our paper "What If We Recaption Billions of Web Images with LLaMA-3 ?"
Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning
Official Dataloader and Evaluation Scripts for LongVideoBench.
Official Implementation of "HMT: Hierarchical Memory Transformer for Long Context Language Processing"
An official implementation of ShareGPT4Video: Improving Video Understanding and Generation with Better Captions
VILA - a multi-image visual language model with training, inference and evaluation recipe, deployable from cloud to edge (Jetson Orin and laptops)
[ICLR 2024 Spotlight] Code for the paper "Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy"