Starred repositories
The code and data of Paper: Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
[CVPR2024 Highlight][VideoChatGPT] ChatGPT with video understanding! And many more supported LMs such as miniGPT4, StableLM, and MOSS.
Qwen2-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
本项目是自动化学报中AUTOPLAN的代码地址,使用大语言模型完成了复杂任务的任务规划以及任务执行
Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision
DAMO-ConvAI: The official repository which contains the codebase for Alibaba DAMO Conversational AI.
PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Models
MLNLP社区用来帮助大家避免论文投稿小错误的整理仓库。 Paper Writing Tips
[NAACL 2024] Official Implementation of paper "Self-Adaptive Sampling for Efficient Video Question Answering on Image--Text Models"
Awesome papers & datasets specifically focused on long-term videos.
Code repository for supporting the paper "Atlas Few-shot Learning with Retrieval Augmented Language Models",(https//arxiv.org/abs/2208.03299)
Materials for the Hugging Face Diffusion Models Course
《动手做科研》面向科研初学者,一步一步地展示如何入门人工智能科研
Long Context Transfer from Language to Vision
This is the official code of VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding (ECCV 2024)
✨✨Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Code for paper "VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos"
该仓库是MLNLP社区用来帮助大家避免论文投稿小错误的整理仓库。 Paper Writing Tips
(2024CVPR) MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
Public release for "Explore until Confident: Efficient Exploration for Embodied Question Answering"
Language Repository for Long Video Understanding
A collection of awesome text-to-image generation studies.