Block or Report
Block or report Charleshhy
Contact GitHub support about this user’s behavior. Learn more about reporting abuse.
Report abuseStars
Language
Sort by: Recently starred
Cambrian-1 is a family of multimodal LLMs with a vision-centric design.
Official repo of the paper "MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos"
📖 A curated list of resources dedicated to hallucination of multimodal large language models (MLLM).
Large Language Models are Temporal and Causal Reasoners for Video Question Answering (EMNLP 2023)
NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions (CVPR'21)
Official repository of NEFTune: Noisy Embeddings Improves Instruction Finetuning
This includes the original implementation of SELF-RAG: Learning to Retrieve, Generate and Critique through self-reflection by Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi.
Offical Code for GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation
Awesome-LLM-RAG: a curated list of advanced retrieval augmented generation (RAG) in Large Language Models
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
[NeurIPS2023] Exploring Diverse In-Context Configurations for Image Captioning
【ICLR 2024, Spotlight】Sentence-level Prompts Benefit Composed Image Retrieval
[CVPR 2024] Official PyTorch implementation of the paper "One For All: Video Conversation is Feasible Without Video Instruction Tuning"
[CVPR 2024] TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
Official Implementation for LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
torch_quantizer is a out-of-box quantization tool for PyTorch models on CUDA backend, specially optimized for Diffusion Models.
A collection of parameter-efficient transfer learning papers focusing on computer vision and multimodal domains.
FastGPT is a knowledge-based platform built on the LLMs, offers a comprehensive suite of out-of-the-box capabilities such as data processing, RAG retrieval, and visual AI workflow orchestration, le…
Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".
Cross-modal few-shot adaptation with CLIP
ImageBind One Embedding Space to Bind Them All
Code for visualizing the loss landscape of neural nets
[CVPR 2024] 🎬💭 chat with over 10K frames of video!
[CVPR 2024] A framework to fine-tune LLaMAs on instruction-following task and get many Stitched LLaMAs with customized number of parameters, e.g., Stitched LLaMA 8B, 9B, and 10B...
arXiv LaTeX Cleaner: Easily clean the LaTeX code of your paper to submit to arXiv
A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities