Awesome Embodied AI

Contributors: todo

Survey

Foundation Models in Robotics: Applications, Challenges, and the Future [paper]
Foundation Models for Decision Making: Problems, Methods, and Opportunities [paper]

Large Language Models (LLMs)

Awesome-LLM [project]
GPT-3: Language Models are Few-Shot Learners [paper]
GPT-4: GPT-4 Technical Report [project]
LLaMA: Open and Efficient Foundation Language Models [paper]
Llama 2: Open Foundation and Fine-Tuned Chat Models [paper]
Mistral 7B [paper]

Vision-Language Models (VLMs)

Image-Language Models

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation [paper]
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models [paper]
Learning Transferable Visual Models From Natural Language Supervision [paper]
Visual Instruction Tuning [paper]
Improved Baselines with Visual Instruction Tuning [paper]
Flamingo: a Visual Language Model for Few-Shot Learning [paper]
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention [paper]
PandaGPT: One Model To Instruction-Follow Them All [paper]
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models [paper]
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning [paper]
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality [paper]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models [paper]
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions [paper]

Video-Language Models

Learning Video Representations from Large Language Models [paper]
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset [paper]
Otter: A Multi-Modal Model with In-Context Instruction Tuning [paper]
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks [paper]
Valley: Video Assistant with Large Language model Enhanced abilitY [paper]
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration [paper]
World Model on Million-Length Video And Language With Blockwise RingAttention [paper]
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding [paper]
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models [paper]
VideoChat: Chat-Centric Video Understanding [paper]
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding [paper]
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models [paper]
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection [paper]
PG-Video-LLaVA: Pixel Grounding Large Video-Language Models [paper]
GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation [paper]

Simulators

VirtualHome: Simulating Household Activities via Programs [paper]
Gibson Env: Real-World Perception for Embodied Agents [paper]
iGibson 2.0: Object-Centric Simulation for Robot Learning of Everyday Household Tasks [paper]
Habitat: A Platform for Embodied AI Research [paper]
Habitat 2.0: Training Home Assistants to Rearrange their Habitat [paper]
Habitat 3.0: A Co-Habitat for Humans, Avatars and Robots [paper]
AI2-THOR: An Interactive 3D Environment for Visual AI [paper]
RoboTHOR: An Open Simulation-to-Real Embodied AI Platform [paper]
BEHAVIOR-1K: A Benchmark for Embodied AI with 1,000 Everyday Activities and Realistic Simulation [paper]
ThreeDWorld：A High-Fidelity, Multi-Modal Platform for Interactive Physical Simulation [paper]
LIBERO: Benchmarking Knowledge Transfer in Lifelong Robot Learning [paper]
ProcTHOR: Large-Scale Embodied AI Using Procedural Generation [paper]
PyBullet：physics simulation for games, visual effects, robotics and reinforcement learning. [paper]

Video Data

Scaling Egocentric Vision: The EPIC-KITCHENS Dataset [paper]
Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100 [paper]
Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives [paper]
BEHAVIOR-1K: A Human-Centered, Embodied AI Benchmark with 1,000 Everyday Activities and Realistic Simulation [paper]
Ego4D: Around the World in 3,000 Hours of Egocentric Video [paper]
Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos [paper]
Delving into Egocentric Actions [paper]

Egocentric

Ego-Topo: Environment Affordances From Egocentric Video [paper]

High-Resolution

OtterHD: A High-Resolution Multi-modality Model [paper]

EAI with Foundation Models

3D-LLM: Injecting the 3D World into Large Language Models [paper]
Reward Design with Language Models [paper]
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances [paper]
Inner Monologue: Embodied Reasoning through Planning with Language Models [paper]
Text2Motion: from natural language instructions to feasible plans [paper]
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models [paper]
ProgPrompt: Generating Situated Robot Task Plans using Large Language Models [paper]
Code as Policies: Language Model Programs for Embodied Control [paper]
ChatGPT for Robotics: Design Principles and Model Abilities [paper]
LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action [paper]
Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments [paper]
L3MVN: Leveraging Large Language Models for Visual Target Navigation [paper]
HomeRobot: Open-Vocabulary Mobile Manipulation [paper]
RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation [paper]
Statler: State-Maintaining Language Models for Embodied Reasoning [paper]
Collaborating with language models for embodied reasoning [paper]
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought [paper]
MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge [paper]
Voyager: An Open-Ended Embodied Agent with Large Language Models [paper]
Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory [paper]
Guiding Pretraining in Reinforcement Learning with Large Language Models [paper]
Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents [paper]

Embodied Multi-modal Language Models

Representing Learning

Language-Driven Representation Learning for Robotics [paper]
R3M: A Universal Visual Representation for Robot Manipulation [paper]
VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training [paper]
LIV: Language-Image Representations and Rewards for Robotic Control [paper]
Learning Language-Conditioned Robot Behavior from Offline Data and Crowd-Sourced Annotation [paper]
DecisionNCE: Embodied Multimodal Representations via Implicit Preference Learning [paper]

End-to-End

Masked Visual Pre-training for Motor Control [paper]
Real-World Robot Learning with Masked Visual Pre-training [paper]
RT-1: Robotics Transformer for Real-World Control at Scale [paper]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control [paper]
Open X-Embodiment: Robotic Learning Datasets and RT-X Models [paper]
PaLM-E: An Embodied Multimodal Language Model [paper]
PaLI-X: On Scaling up a Multilingual Vision and Language Model [paper]
A Generalist Agent [paper]

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome Embodied AI

Survey

Large Language Models (LLMs)

Vision-Language Models (VLMs)

Image-Language Models

Video-Language Models

Simulators

Video Data

Egocentric

High-Resolution

EAI with Foundation Models

Embodied Multi-modal Language Models

Representing Learning

End-to-End

Benchmarks

About

Releases

Packages

Contributors 3

License

AdaCheng/Awesome-Embodied-AI

Folders and files

Latest commit

History

Repository files navigation

Awesome Embodied AI

Survey

Large Language Models (LLMs)

Vision-Language Models (VLMs)

Image-Language Models

Video-Language Models

Simulators

Video Data

Egocentric

High-Resolution

EAI with Foundation Models

Embodied Multi-modal Language Models

Representing Learning

End-to-End

Benchmarks

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Packages