Multimodel
ImageBind One Embedding Space to Bind Them All
Learning audio concepts from natural language supervision
An open source implementation of CLIP.
Implementation of PALI3 from the paper PALI-3 VISION LANGUAGE MODELS: SMALLER, FASTER, STRONGER"
Prompt Learning for Vision-Language Models (IJCV'22, CVPR'22)
CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
Open-sourced codes for MiniGPT-4 and MiniGPT-v2 (https://minigpt-4.github.io, https://minigpt-v2.github.io/)
LAVIS - A One-stop Library for Language-Vision Intelligence
✨✨Latest Advances on Multimodal Large Language Models
SpeechGPT Series: Speech Large Language Models
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
The official repo of Qwen-VL (通义千问-VL) chat & pretrained large vision language model proposed by Alibaba Cloud.