Stars
Paper list about multimodal and large language models, only used to record papers I read in the daily arxiv for personal needs.
Offical code for Interpret Your Decision: Logical Reasoning Regularization for Generalization in Visual Classification (NeurIPS 2024 Spotlight)
Official repo for paper "MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions"
Baichuan-Omni: Towards Capable Open-source Omni-modal LLM 🌊
Codebase for Aria - an Open Multimodal Native MoE
[Preprint] TRACE: Temporal Grounding Video LLM via Casual Event Modeling
Official Repository of paper VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models
Implementation of the paper: "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"
👾 E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding (NeurIPS 2024)
Reading notes about Multimodal Large Language Models, Large Language Models, and Diffusion Models
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
A family of open-sourced Mixture-of-Experts (MoE) Large Language Models
📰 Must-read papers on KV Cache Compression (constantly updating 🤗).
A collection of LLM papers, blogs, and projects, with a focus on OpenAI o1 and reasoning techniques.
Inference Code for Paper "Harder Tasks Need More Experts: Dynamic Routing in MoE Models"
Writing AI Conference Papers: A Handbook for Beginners
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture
An VideoQA dataset based on the videos from ActivityNet
Qwen2-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
Video Question Answering via Gradually Refined Attention over Appearance and Motion
✨✨VITA: Towards Open-Source Interactive Omni Multimodal LLM
Triton-based implementation of Sparse Mixture of Experts.
VILA - a multi-image visual language model with training, inference and evaluation recipe, deployable from cloud to edge (Jetson Orin and laptops)
mPLUG-Owl: The Powerful Multi-modal Large Language Model Family
Implementation of 💍 Ring Attention, from Liu et al. at Berkeley AI, in Pytorch
Long Context Transfer from Language to Vision