Skip to content

Learn the colorful world (Vision/Audio/Robotic) from LLM

License

Notifications You must be signed in to change notification settings

rese1f/Awesome-Colorful-LLM

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

54 Commits
 
 
 
 

Repository files navigation

Awesome-Colorful Large Language Model Awesome

A curated list of Large Language Model ➕ Vision/Audio/Robotic and Augmented Language Model (action, reasoning).

CONTENTS

VISION

Benchmarks

Benchmark Task Data Paper Preprint Publication Affiliation
INFOSEEK VQA OVEN (open domain image) + Human Anno. Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions? 2302.11713 Google

Image Language Model

Reading List

Paper Base Language Model Code Publication Preprint Affiliation
Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models LLaMA LaVIN 2305.15023 Xiamen Univ.
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks Alpaca VisionLLM 2305.11175 Shanghai AI Lab.
Otter: A Multi-Modal Model with In-Context Instruction Tuning Flamingo Otter 2305.03726 NTU
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages ChatGPT X-LLM 2305.04160 CAS
Multimodal Procedural Planning via Dual Text-Image Prompting OFA, BLIP, GPT3 TIP 2305.01795 UCSB
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model LLaMA LLaMA-Adapter 2304.15010 Shanghai AI Lab.
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality, mPLUG, mPLUG-2 LLaMA mPLUG-Owl 2304.14178 DAMO Academy
MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models Vicunna MiniGPT4 2304.github KAUST
Visual Instruction Tuning LLaMA LLaVA 2304.02643 Microsoft
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action ChatGPT MM-REACT 2303.11381 Microsoft
ViperGPT: Visual Inference via Python Execution for Reasoning Codex ViperGPT 2303.08128 Columbia
Scaling Vision-Language Models with Sparse Mixture of Experts (MOE + Scaling) 2303.07226 Microsoft
ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions ChatGPT, Flan-T5 (BLIP2) ChatCaptioner 2303.06594 KAUST
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models ChatGPT Visual ChatGPT 2303.04671 Microsoft
PaLM-E: An Embodied Multimodal Language Model PaLM 2303.03378 Google
Prismer: A Vision-Language Model with An Ensemble of Experts RoBERTa, OPT, BLOOM Prismer 2303.02506 Nvidia
Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering GPT3 Prophet CVPR2023 2303.01903 HDU
Language Is Not All You Need: Aligning Perception with Language Models Magneto KOSMOS-1 2302.14045 Microsoft
Scaling Vision Transformers to 22 Billion Parameters (CLIP + Scaling) 2302.05442 Google
Multimodal Chain-of-Thought Reasoning in Language Models T5 MM-COT 2302.00923 Amazon
Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot Image Caption RETRO 2302.04858 Nvidia
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models Flan-T5 BLIP2 ICML2023 2301.12597 Salesforce
See, Think, Confirm: Interactive Prompting Between Vision and Language Models for Knowledge-based Visual Reasoning OPT 2301.05226 MIT-IBM
Generalized Decoding for Pixel, Image, and Language GPT3 X-GPT 2212.11270 Microsoft
From Images to Textual Prompts: Zero-shot Visual Question Answering with Frozen Large Language Models OPT Img2LLM CVPR2023 2212.10846 Salesforce
Language Models are General-Purpose Interfaces DeepNorm METALM 2206.06336 Microsoft
Language Models Can See: Plugging Visual Controls in Text Generation GPT2 MAGIC 2205.02655 Tencent
Flamingo: a Visual Language Model for Few-Shot Learning Chinchilla Flamingo NIPS 2022 2204.14198 DeepMind
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA GPT3 PICa AAAI2022 2109.05014 Microsoft
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language GPT3, RoBERTa Socratic Models ICLR 2023 2204.00598 Google
Learning Transferable Visual Models From Natural Language Supervision Bert CLIP ICML 2021 2103.00020 OpenAI

Dataset

Video Language Model

Reading List

Paper Base Language Model Code Publication Preprint Affiliation
Macaw-LLM: Multi-Modal Language Modeling with Image, Video, Audio, and Text Integration LLaMA Macaw-LLM 2305.github Tencent
Video-LLaMA: An Instruction-Finetuned Visual Language Model for Video Understanding LLaMA Video-LLaMA 2305.github Alibaba
Self-Chained Image-Language Model for Video Localization and Question Answering BLIP2 SeViLA 2305.06988 UNC
VideoChat: Chat-Centric Video Understanding Blip2 VideoChat 2305.06355 Shanghai AI Lab
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages ChatGPT X-LLM 2305.04160 CAS
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset Bert VALOR 2304.08345 UCAS
Verbs in Action: Improving verb understanding in video-language models PaLM 2304.06708 Google
Video ChatCaptioner: Towards the Enriched Spatiotemporal Descriptions ChatGPT, Flan-T5 (BLIP2) ChatCaptioner 2304.04227 KAUST
Language Models are Causal Knowledge Extractors for Zero-shot Video Question Answering GPT2, GPT-Neo, GPT3 CVPR2023 workshop 2304.03754 Columbia Univ.
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning T5 Vid2Seq 2302.14115 Google
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training Bert 2212.14546 Alibaba
VindLU: A Recipe for Effective Video-and-Language Pretraining Bert VindLU 2212.05051 UNC
SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-training Bert 2211.11446 UW
CLOP: Video-and-Language Pre-Training with Knowledge Regularizations Roberta MM 2022 2211.03314 Baidu
Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning Bert NIPS 2022 2210.06031 Microsoft
OmniVL: One Foundation Model for Image-Language and Video-Language Tasks Bert NIPS 2022 2209.07526 Microsoft
Clover: Towards A Unified Video-Language Alignment and Fusion Model Bert Clover 2207.07885 Bytedance
LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling Bert-like LAVENDER CVPR 2023 2206.07160 Microsoft
Revealing Single Frame Bias for Video-and-Language Learning Bert Singularity 2206.03428 UNC
Flamingo: a Visual Language Model for Few-Shot Learning Chinchilla Flamingo NIPS 2022 2204.14198 DeepMind
All in One: Exploring Unified Video-Language Pre-training Bert-like All-In-One CVPR 2023 2203.07303 NUS
End-to-end Generative Pretraining for Multimodal Video Captioning Bert+GPT2 CVPR 2022 2201.08264 Google
Align and Prompt: Video-and-Language Pre-training with Entity Prompts Bert-like ALPRO CVPR 2022 2112.09583 Salesforce
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling,V2 Bert VIOLET 2111.12681 Microsoft
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding Bert VideoCLIP EMNLP 2021 2109.14084 Facebook
MERLOT: Multimodal Neural Script Knowledge Models,V2 Roberta MERLOT NIPS 2021 2106.02636 AI2
VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding Bert VLP ACL Findings 2021 2105.09996 Facebook
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text Bert-like NIPS 2021 2104.11178 Google
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval Bert-like CLIP4Clip Neurocomputing 2022 2104.08860 Microsoft
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval Bert Frozen-in-Time ICCV 2021 2104.00650 Oxford
Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling Bert ClipBert CVPR 2021 2102.06183 Microsoft
ActBERT: Learning Global-Local Video-Text Representations Bert ActBert CVPR 2020 2011.07231 Baidu
Video Understanding as Machine Translation T5 2006.07203 Facebook
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training Bert HERO EMNLP 2020 2005.00200 Microsoft
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation Bert UniVL 2002.06353 Microsoft
Learning Video Representations using Contrastive Bidirectional Transformer Bert 1906.05743 Google
VideoBERT: A Joint Model for Video and Language Representation Learning Bert VideoBert (non-official) ICCV 2019 1904.01766 Google

Pretraining Tasks

Commmonly Used Pretraining Tasks
  • Masked Language Modeling (MLM)
  • Causal Language Modeling (LM)
  • Masked Vision Modeling (MLM)
    • Vision = Frame
    • Vision = Patch
    • VIsion = Object
  • Video Language Matching (VLM)
  • Video Language Contrastive (VLC)

Datasets

Commmonly Used Video Corpus for Pretraining
Paper Video Clips Duration Sentences Domain Download Link
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval 2.5M 18s 2.5M open WebVid-2M
Howto100m: Learning a text-video embedding by watching hundred million narrated video clips 136M 4s 136M instruction HowTo100M
Merlot: Multimodal neural script knowledge models. Advances in Neural Information Processing 6M -20m ~720M open YT-Temporal-180M
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions 100M 13.4s 100M open HD-VILA-100M
CHAMPAGNE: Learning Real-world Conversation from Large-Scale Web Videos 18M 60s open YTD-18M
Commmonly Used Downsteam Tasks
Task Paper Download Link Publication
Retrieval Collecting Highly Parallel Data for Paraphrase Evaluation MSVD ACL 2011
Retrieval A Dataset for Movie Description LSMDC CVPR 2015
Retrieval MSR-VTT: A Large Video Description Dataset for Bridging Video and Language MSR-VTT CVPR 2016
Retrieval Localizing Moments in Video with Natural Language DiDeMo ICCV 2017
Retrieval Dense-Captioning Events in Videos ActivityNet Caption ICCV 2017
Retrieval Towards Automatic Learning of Procedures from Web Instructional Videos YouCook2 AAAI 2018
OE QA TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering TGIF-Frame CVPR 2017
OE QA A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering LSMDC-FiB CVPR 2017
OE QA Video Question Answering via Gradually Refined Attention over Appearance and Motion MSRVTT-QA,MSVD-QA MM 2017
OE QA ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering ActivityNet-QA AAAI 2019
MC QA Learning Language-Visual Embedding for Movie Understanding with Natural-Language LSMDC-MC
MC QA TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering TGIF-Action, TGIF-Transition CVPR 2017
MC QA A Joint Sequence Fusion Model for Video Question Answering and Retrieval MSRVTT-MC ECCV 2018
Caption Collecting Highly Parallel Data for Paraphrase Evaluation MSVD ACL 2011
Caption MSR-VTT: A Large Video Description Dataset for Bridging Video and Language MSR-VTT CVPR 2016
Dense Caption Dense-Captioning Events in Videos ActivityNet Caption ICCV 2017
Dense Caption Towards Automatic Learning of Procedures from Web Instructional Videos YouCook2 AAAI 2018
Dense Caption Multimodal Pretraining for Dense Video Captioning ViTT AACL 2020
Advanced Video Language Tasks
paper task duration domain link publication
From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question-Answering Video QA 9s open Causal-VidQA CVPR 2022
VIOLIN: A Large-Scale Dataset for Video-and-Language Inference Video Language Inference 35.2s movie VIOLIN CVPR 2020
TVQA: Localized, Compositional Video Question Answering Video QA 60-90s movie TVQA EMNLP 2018
AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning Video QA 30s open AGQA CVPR 2021
NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions Video QA 44s open NExT-QA-MC, NExT-QA-OE CVPR 2021
STAR: A Benchmark for Situated Reasoning in Real-World Videos Video QA 12s open Star NIPS 2021
Env-QA: A Video Question Answering Benchmark for Comprehensive Understanding of Dynamic Environments Video QA 20s virtual env. Env-QA ICCV 2021
Social-IQ: A Question Answering Benchmark for Artificial Social Intelligence Video QA 60s open Social-IQ CVPR 2019

Image Generation

Reading List

Tutorials

Other Curated Lists

Model:

Dataset:

Audio

Other Curated Lists

Robotic

Reading List

Paper Base Language Model Code Publication Preprint Affiliation
Chat with the Environment: Interactive Multimodal Perception using Large Language Models GPT3 2303.08268 Universitat Hamburg

Other Curated Lists

Augmented Language Model

Reading List

Survey

  • (2023-04) Tool Learning with Foundation Models paper
  • (2023-02) Augmented Language Models: a Survey paper

Reading List

Paper LLM Code Publication Preprint Affiliation
LLM+P: Empowering Large Language Models with Optimal Planning Proficiency GPT4 LLM-PDDL 2304.11477 UTEXAS
Can GPT-4 Perform Neural Architecture Search? GPT4 GENIUS 2304.10970 Cambridge
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models GPT4 Chameleon 2304.09842 Microsoft
OpenAGI: When LLM Meets Domain Experts ChatGPT OpenAGI 2304.04370 Rutgers Univ.
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace ChatGPT JARVIS 2303.17580 Microsoft
Language Models can Solve Computer Tasks ChatGPT, GPT3, etc RCI Agent 2303.17491 CMU
TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs ChatGPT TaskMatrix 2303.16434 Microsoft
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action ChatGPT MM-REACT 2303.11381 Microsoft
ART: Automatic multi-step reasoning and tool-use for large language models GPT3, Codex Language-Programmes 2303.09014 Microsoft
Foundation Models for Decision Making: Problems, Methods, and Opportunities - - 2303.04129 Google
Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback ChatGPT LLM-Augmenter 2302.12813 Microsoft
Toolformer: Language Models Can Teach Themselves to Use Tools GPT-J, OPT, GPT3 Toolformer (Unofficial) 2302.04761 Meta
Visual Programming: Compositional visual reasoning without training GPT3 VisProg CVPR2023 2211.11559 AI2

Projects

Other Curated Lists

Related

Contributing

Please freely create pull request or drop me an email

About

Learn the colorful world (Vision/Audio/Robotic) from LLM

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published