Contributors: todo
- Foundation Models in Robotics: Applications, Challenges, and the Future [paper]
- Foundation Models for Decision Making: Problems, Methods, and Opportunities [paper]
- Awesome-LLM [project]
- GPT-3: Language Models are Few-Shot Learners [paper]
- GPT-4: GPT-4 Technical Report [project]
- LLaMA: Open and Efficient Foundation Language Models [paper]
- Llama 2: Open Foundation and Fine-Tuned Chat Models [paper]
- Mistral 7B [paper]
- BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation [paper]
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models [paper]
- Learning Transferable Visual Models From Natural Language Supervision [paper]
- Visual Instruction Tuning [paper]
- Improved Baselines with Visual Instruction Tuning [paper]
- Flamingo: a Visual Language Model for Few-Shot Learning [paper]
- LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention [paper]
- PandaGPT: One Model To Instruction-Follow Them All [paper]
- OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models [paper]
- InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning [paper]
- mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality [paper]
- MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models [paper]
- ShareGPT4V: Improving Large Multi-Modal Models with Better Captions [paper]
- Learning Video Representations from Large Language Models [paper]
- VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset [paper]
- Otter: A Multi-Modal Model with In-Context Instruction Tuning [paper]
- Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks [paper]
- Valley: Video Assistant with Large Language model Enhanced abilitY [paper]
- Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration [paper]
- World Model on Million-Length Video And Language With Blockwise RingAttention [paper]
- Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding [paper]
- LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models [paper]
- VideoChat: Chat-Centric Video Understanding [paper]
- Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding [paper]
- Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models [paper]
- Video-LLaVA: Learning United Visual Representation by Alignment Before Projection [paper]
- PG-Video-LLaVA: Pixel Grounding Large Video-Language Models [paper]
- GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation [paper]
- VirtualHome: Simulating Household Activities via Programs [paper]
- Gibson Env: Real-World Perception for Embodied Agents [paper]
- iGibson 2.0: Object-Centric Simulation for Robot Learning of Everyday Household Tasks [paper]
- Habitat: A Platform for Embodied AI Research [paper]
- Habitat 2.0: Training Home Assistants to Rearrange their Habitat [paper]
- Habitat 3.0: A Co-Habitat for Humans, Avatars and Robots [paper]
- AI2-THOR: An Interactive 3D Environment for Visual AI [paper]
- RoboTHOR: An Open Simulation-to-Real Embodied AI Platform [paper]
- BEHAVIOR-1K: A Benchmark for Embodied AI with 1,000 Everyday Activities and Realistic Simulation [paper]
- ThreeDWorld:A High-Fidelity, Multi-Modal Platform for Interactive Physical Simulation [paper]
- LIBERO: Benchmarking Knowledge Transfer in Lifelong Robot Learning [paper]
- ProcTHOR: Large-Scale Embodied AI Using Procedural Generation [paper]
- PyBullet:physics simulation for games, visual effects, robotics and reinforcement learning. [paper]
- Scaling Egocentric Vision: The EPIC-KITCHENS Dataset [paper]
- Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100 [paper]
- Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives [paper]
- BEHAVIOR-1K: A Human-Centered, Embodied AI Benchmark with 1,000 Everyday Activities and Realistic Simulation [paper]
- Ego4D: Around the World in 3,000 Hours of Egocentric Video [paper]
- Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos [paper]
- Delving into Egocentric Actions [paper]
- Ego-Topo: Environment Affordances From Egocentric Video [paper]
- OtterHD: A High-Resolution Multi-modality Model [paper]
- 3D-LLM: Injecting the 3D World into Large Language Models [paper]
- Reward Design with Language Models [paper]
- Do As I Can, Not As I Say: Grounding Language in Robotic Affordances [paper]
- Inner Monologue: Embodied Reasoning through Planning with Language Models [paper]
- Text2Motion: from natural language instructions to feasible plans [paper]
- VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models [paper]
- ProgPrompt: Generating Situated Robot Task Plans using Large Language Models [paper]
- Code as Policies: Language Model Programs for Embodied Control [paper]
- ChatGPT for Robotics: Design Principles and Model Abilities [paper]
- LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action [paper]
- Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments [paper]
- L3MVN: Leveraging Large Language Models for Visual Target Navigation [paper]
- HomeRobot: Open-Vocabulary Mobile Manipulation [paper]
- RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation [paper]
- Statler: State-Maintaining Language Models for Embodied Reasoning [paper]
- Collaborating with language models for embodied reasoning [paper]
- EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought [paper]
- MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge [paper]
- Voyager: An Open-Ended Embodied Agent with Large Language Models [paper]
- Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory [paper]
- Guiding Pretraining in Reinforcement Learning with Large Language Models [paper]
- Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents [paper]
- Language-Driven Representation Learning for Robotics [paper]
- R3M: A Universal Visual Representation for Robot Manipulation [paper]
- VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training [paper]
- LIV: Language-Image Representations and Rewards for Robotic Control [paper]
- Learning Language-Conditioned Robot Behavior from Offline Data and Crowd-Sourced Annotation [paper]
- DecisionNCE: Embodied Multimodal Representations via Implicit Preference Learning [paper]
- Masked Visual Pre-training for Motor Control [paper]
- Real-World Robot Learning with Masked Visual Pre-training [paper]
- RT-1: Robotics Transformer for Real-World Control at Scale [paper]
- RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control [paper]
- Open X-Embodiment: Robotic Learning Datasets and RT-X Models [paper]
- PaLM-E: An Embodied Multimodal Language Model [paper]
- PaLI-X: On Scaling up a Multilingual Vision and Language Model [paper]
- A Generalist Agent [paper]