Skip to content

wechto/paper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

READING LIST

πŸ”₯ πŸ’₯ πŸ’’ ❗ πŸŒ€ 🌊 πŸ’¦ πŸ’§ 🐒 🐸 🐍 πŸ› 🐏 πŸ‘ πŸ‡ 🐐 🐳 πŸ‹ 🐟 🐬 🌹 🍁 πŸ‚ πŸ„ 🍺 🍻 ✈️ β›² πŸš… πŸš€ 🚣 🚀

🌾 πŸ“Ÿ πŸ“„ πŸ“ƒ πŸ“– πŸ“˜ πŸ”– πŸ“‘ πŸ“š πŸ“‹ πŸ“Š πŸ“‰ πŸ“ˆ πŸ—ƒοΈ πŸ“‡ πŸ—‚οΈ 🎴 πŸ“° πŸ—žοΈ πŸ“• πŸ“™ πŸ““ πŸ“” πŸ“’ πŸ—’οΈ πŸ“— πŸ“œ πŸ“Œ πŸ–‹οΈ πŸ–ŠοΈ ✏️ πŸŽ…

Recommendation: πŸ‘ πŸ”₯ πŸŒ‹ πŸ’₯

To-Do (Reading) List: πŸ’§ πŸ’¦ [εΉΆζ²‘ζœ‰θ―΄ζ°΄ηš„ζ„ζ€πŸ˜πŸ™]

TOC

Emp. & ASP

β›… πŸŒ• 🐚 🌱 🌲 πŸ„ πŸƒ 🌻 🌡 πŸ’ 🌿 🌴 πŸ€ 🌳 🌹 🌰 🎍 🌾 🌴 🍁 🐾 πŸŒ‘ πŸŒ“ πŸŒ– 🌘 πŸŒ• 🌌 🌐 🌎 πŸŒ‹ πŸŽƒ

Meta-RL

🐸 🐯 🐌 🐍 🐫 🐯 🐒 🐦 🐜 🐨 🐢 πŸͺ² πŸ” πŸ€ πŸ‚ πŸ„ 🐳 🐟 🐬 πŸ‹ 🐈 🐑 πŸ‰ 🐲 🐐 πŸ™ 🐜 🐒 🐊 🐀

  • A Meta-Transfer Objective for Learning to Disentangle Causal Mechanisms 2020 https://arxiv.org/pdf/1901.10912.pdf Yoshua Bengio 🌌 πŸ”₯ πŸ”₯ πŸ‘ 🍁 πŸ’¦ [contrative loss on causal mechanisms?]

    We show that under this assumption, the correct causal structural choices lead to faster adaptation to modified distributions because the changes are concentrated in one or just a few mechanisms when the learned knowledge is modularized appropriately.

  • Causal Reasoning from Meta-reinforcement Learning 2019 πŸ˜‰ 😢

  • Discovering Reinforcement Learning Algorithms https://arxiv.org/pdf/2007.08794.pdf πŸ‘

    This paper introduces a new meta-learning approach that discovers an entire update rule which includes both β€˜what to predict’ (e.g. value functions) and β€˜how to learn from it’ (e.g. bootstrapping) by interacting with a set of environments.

  • Meta

    πŸ”Ή Discovering Reinforcement Learning Algorithms Attempte to discover the full update rule πŸ‘ ​

    πŸ”Ή What Can Learned Intrinsic Rewards Capture? How/What value function/policy network πŸ‘

    ​ lifetime return:A finite sequence of agent-environment interactions until the end of training defined by an agentdesigner, which can consist of multiple episodes.

    πŸ”Ή Discovery of Useful Questions as Auxiliary Tasks πŸ˜•

    ​ Related work is good! (Prior work on auxiliary tasks in RL + GVF) πŸ”₯ πŸ‘

    πŸ”Ή Meta-Gradient Reinforcement Learning discount factor + bootstrapped factor πŸ’¦ ​

    πŸ”Ή BEYOND EXPONENTIALLY DISCOUNTED SUM: AUTOMATIC LEARNING OF RETURN FUNCTION 😢

    We research how to modify the form of the return function to enhance the learning towards the optimal policy. We propose to use a general mathematical form for return function, and employ meta-learning to learn the optimal return function in an end-to-end manner.

    πŸ”Ή Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks πŸ”₯ πŸŒ‹ πŸ’₯

    MAML: In our approach, the parameters of the model are explicitly trained such that a small number of gradient steps with a small amount of training data from a new task will produce good generalization performance on that task.

    πŸ”Ή BERT Learns to Teach: Knowledge Distillation with Meta Learning πŸŒ‹

    MetaDistill: We show the teacher network can learn to better transfer knowledge to the student network (i.e., learning to teach) with the feedback from the performance of the distilled student network in a meta learning framework.

    πŸ”Ή Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables πŸ”₯ πŸ”₯

    PEARL: Current methods rely heavily on on-policy experience, limiting their sample efficiency. They also lack mechanisms to reason about task uncertainty when adapting to new tasks, limiting their effectiveness on sparse reward problems. We address these challenges by developing an offpolicy meta-RL algorithm that disentangles task inference and control.

    πŸ”Ή Guided Meta-Policy Search πŸ‘ πŸ”₯ πŸŒ‹

    GMPS: We propose to learn a RL procedure in a federated way, where individual off-policy learners can solve the individual meta-training tasks, and then consolidate these solutions into a single meta-learner. Since the central meta-learner learns by imitating the solutions to the individual tasks, it can accommodate either the standard meta-RL problem setting, or a hybrid setting where some or all tasks are provided with example demonstrations.

    πŸ”Ή CoMPS: Continual Meta Policy Search πŸ”₯

    CoMPS continuously repeats two subroutines: learning a new task using RL and using the experience from RL to perform completely offline meta-learning to prepare for subsequent task learning.

    πŸ”Ή Bootstrapped Meta-Learning πŸ”₯ πŸŒ‹

    We propose an algorithm that tackles these issues by letting the metalearner teach itself. The algorithm first bootstraps a target from the meta-learner, then optimises the meta-learner by minimising the distance to that target under a chosen (pseudo-)metric.

    πŸ”Ή Taming MAML: Efficient Unbiased Meta-Reinforcement Learning πŸ‘ πŸ”₯

    TMAML: that adds control variates into gradient estimation via automatic differentiation. TMAML improves the quality of gradient estimation by reducing variance without introducing bias.

    πŸ”Ή NoRML: No-Reward Meta Learning πŸ‘

    NoRML: The key insight underlying NoRML is that we can simultaneously learn the meta-policy and the advantage function used for adapting the meta-policy, optimizing for the ability to effectively adapt to varying dynamics.

    πŸ”Ή SKILL-BASED META-REINFORCEMENT LEARNING πŸ‘ πŸ”₯

    we propose to (1) extract reusable skills and a skill prior from offline datasets, (2) meta-train a high-level policy that learns to efficiently compose learned skills into long-horizon behaviors, and (3) rapidly adapt the meta-trained policy to solve an unseen target task.

    πŸ”Ή Offline Meta Learning of Exploration πŸ’§

    We take a Bayesian RL (BRL) view, and seek to learn a Bayes-optimal policy from the offline data. Building on the recent VariBAD BRL approach, we develop an off-policy BRL method that learns to plan an exploration strategy based on an adaptive neural belief estimate.

  • Unsupervised Meta-Learning for Reinforcement Learning https://arxiv.org/pdf/1806.04640.pdf [Abhishek Gupta, Benjamin Eysenbach, Chelsea Finn, Sergey Levine] πŸ˜• πŸ˜‰

    Meta-RL shifts the human burden from algorithm to task design. In contrast, our work deals with the RL setting, where the environment dynamics provides a rich inductive bias that our meta-learner can exploit.

    πŸ”Ή UNSUPERVISED LEARNING VIA META-LEARNING πŸ˜‰ ​We construct tasks from unlabeled data in an automatic way and run meta-learning over the constructed tasks.

    πŸ”Ή Unsupervised Curricula for Visual Meta-Reinforcement Learning [Allan JabriΞ±; Kyle Hsu] πŸ‘ πŸ’§ πŸŒ‹ πŸ”₯

    Yet, the aforementioned relation between skill acquisition and meta-learning suggests that they should not be treated separately.

    However, relying solely on discriminability becomes problematic in environments with high-dimensional (image-based) observation spaces as it results in an issue akin to mode-collapse in the task space. This problem is further complicated in the setting we propose to study, wherein the policy data distribution is that of a meta-learner rather than a contextual policy. We will see that this can be ameliorated by specifying a hybrid discriminative-generative model for parameterizing the task distribution.

    We, rather, will tolerate lossy representations as long as they capture discriminative features useful for stimulus-reward association.

    πŸ”Ή On the Effectiveness of Fine-tuning Versus Meta-reinforcement Learning 😢

    Conclusion: multi-task pretraining with fine-tuning on new tasks performs equally as well, or better, than meta-RL.

    πŸ”Ή Enhanced Meta Reinforcement Learning using Demonstrations in Sparse Reward Environments πŸ”₯

    (EMRLD) that combines RL-based policy improvement and behavior cloning from demonstrations for task-specific adaptation.

  • Asymmetric Distribution Measure for Few-shot Learning https://arxiv.org/pdf/2002.00153.pdf πŸ‘

    feature representations and relation measure.

  • latent models

    πŸ”Ή MELD: Meta-Reinforcement Learning from Images via Latent State Models πŸ‘ πŸ‘ ​ ​

    we leverage the perspective of meta-learning as task inference to show that latent state models can also perform meta-learning given an appropriately defined observation space.

    πŸ”Ή Explore then Execute: Adapting without Rewards via Factorized Meta-Reinforcement Learning πŸ‘ πŸ‘

    based on identifying key information in the environment, independent of how this information will exactly be used solve the task. By decoupling exploration from task execution, DREAM explores and consequently adapts to new environments, requiring no reward signal when the task is specified via an instruction.

  • model identification and experience relabeling (MIER)

    πŸ”Ή Meta-Reinforcement Learning Robust to Distributional Shift via Model Identification and Experience Relabeling πŸ‘ πŸ”₯ ​ ​

    Our method is based on a simple insight: we recognize that dynamics models can be adapted efficiently and consistently with off-policy data, more easily than policies and value functions. These dynamics models can then be used to continue training policies and value functions for out-of-distribution tasks without using meta-reinforcement learning at all, by generating synthetic experience for the new task.

    πŸ”Ή Distributionally Adaptive Meta Reinforcement Learning πŸ”₯

    DiAMetR: Our framework centers on an adaptive approach to distributional robustness that trains a population of meta-policies to be robust to varying levels of distribution shift. When evaluated on a potentially shifted test-time distribution of tasks, this allows us to choose the meta-policy with the most appropriate level of robustness, and use it to perform fast adaptation.

πŸ”Ή PaCo: Parameter-Compositional Multi-Task Reinforcement Learning πŸ”₯

A policy subspace represented by a set of parameters is learned. Policies for all the single tasks lie in this subspace and can be composed by interpolating with the learned set.

HRL

SKILLS

  • Latent Space Policies for Hierarchical Reinforcement Learning 2018

  • EPISODIC CURIOSITY THROUGH REACHABILITY [reward design]

    In particular, inspired by curious behaviour in animals, observing something novel could be rewarded with a bonus. Such bonus is summed up with the real task reward β€” making it possible for RL algorithms to learn from the combined reward. We propose a new curiosity method which uses episodic memory to form the novelty bonus. πŸ’§ To determine the bonus, the current observation is compared with the observations in memory. Crucially, the comparison is done based on how many environment steps it takes to reach the current observation from those in memory β€” which incorporates rich information about environment dynamics. This allows us to overcome the known β€œcouch-potato” issues of prior work β€” when the agent finds a way to instantly gratify itself by exploiting actions which lead to hardly predictable consequences.

πŸ”Ή Where To Start? Transferring Simple Skills to Complex Environments πŸ”₯

we introduce an affordance model based on a graph representation of an environment, which is optimised during deployment to find suitable robot configurations to start a skill from, such that the skill can be executed without any collisions.

Control as Inference

πŸ”Ή Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review πŸ’₯ πŸ’₯ πŸŒ‹ πŸŒ‹ ​

Graphic model for control as inference (Decision Making Problem and Terminology; The Graphical Model; Policy search as Probabilistic Inference; Which Objective does This Inference Procedure Optimize; Alternative Model Formulations);

Variation Inference and Stochastic Dynamic(Maximium RL with Fixed Dynamics; Connection to Structured VI);

Approximate Inference with Function Approximation(Maximum Entropy PG; Maxium Entropy AC Algorithms)

πŸ”Ή On Stochastic Optimal Control and Reinforcement Learning by Approximate Inference πŸ’¦ ​

emphasizes that MaxEnt RL can be viewed as minimizing an KL divergence.

πŸ”Ή Iterative Inference Models Iterative Amortized Inference πŸ‘ πŸ‘ ​ ​

Latent Variable Models & Variational Inference & Variational Expectation Maximization (EM) & Inference Models

πŸ”Ή MAKING SENSE OF REINFORCEMENT LEARNING AND PROBABILISTIC INFERENCE ❓ πŸ’¦ ​

πŸ”Ή Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable Model πŸŒ‹ πŸ’₯ ​ ​

The main contribution of this work is a novel and principled approach that integrates learning stochastic sequential models and RL into a single method, performing RL in the model’s learned latent space. By formalizing the problem as a control as inference problem within a POMDP, we show that variational inference leads to the objective of our SLAC algorithm.

πŸ”Ή On the Design of Variational RL Algorithms πŸ‘ πŸ‘ πŸ”₯ ​ Good design choices. πŸ’¦ πŸ‘» ​

Identify several settings that have not yet been fully explored, and we discuss general directions for improving these algorithms: VI details; (non-)Parametric; Uniform/Learned Prior.

πŸ”Ή VIREL: A Variational Inference Framework for Reinforcement Learning πŸ˜• ​

existing inference frameworks and their algorithms pose significant challenges for learning optimal policies, for example, the lack of mode capturing behaviour in pseudo-likelihood methods, difficulties learning deterministic policies in maximum entropy RL based approaches, and a lack of analysis when function approximators are used.

πŸ”Ή MAXIMUM A POSTERIORI POLICY OPTIMISATION πŸ”₯ πŸ‘ πŸ’₯

MPO based on coordinate ascent on a relative entropy objective. We show that several existing methods can directly be related to our derivation.

πŸ”Ή V-MPO: ON-POLICY MAXIMUM A POSTERIORI POLICY OPTIMIZATION FOR DISCRETE AND CONTINUOUS CONTROL πŸ‘ πŸ”₯ πŸŒ‹ πŸ’₯ ​ ​ ​ ​

adapts Maximum a Posteriori Policy Optimization to the on-policy setting.

πŸ”Ή SOFT Q-LEARNING WITH MUTUAL-INFORMATION REGULARIZATION πŸ‘ πŸ”₯ πŸŒ‹

In this paper, we propose a theoretically motivated framework that dynamically weights the importance of actions by using the mutual information. In particular, we express the RL problem as an inference problem where the prior probability distribution over actions is subject to optimization.

​

State Abstraction, Representation Learning

Representation learning for control based on bisimulation does not depend on reconstruction, but aims to group states based on their behavioral similarity in MDP. lil-log πŸ’¦

πŸ”Ή Equivalence Notions and Model Minimization in Markov Decision Processes https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.61.2493&rep=rep1&type=pdf : refers to an equivalence relation between two states with similar long-term behavior. πŸ˜•

BISIMULATION METRICS FOR CONTINUOUS MARKOV DECISION PROCESSES

πŸ”Ή DeepMDP: Learning Continuous Latent Space Models for Representation Learning https://arxiv.org/pdf/1906.02736.pdf simplifies high-dimensional observations in RL tasks and learns a latent space model via minimizing two losses: prediction of rewards and prediction of the distribution over next latent states. πŸ“£ 🌌 πŸ˜• πŸ’₯ πŸ’£ πŸ’₯

πŸ”Ή DARLA: Improving Zero-Shot Transfer in Reinforcement Learning πŸ”₯

We propose a new multi-stage RL agent, DARLA (DisentAngled Representation Learning Agent), which learns to see before learning to act. DARLA’s vision is based on learning a disentangled representation of the observed environment. Once DARLA can see, it is able to acquire source policies that are robust to many domain shifts - even with no access to the target domain.

πŸ”Ή DBC: Learning Invariant Representations for Reinforcement Learning without Reconstruction πŸ’₯ πŸ’₯ πŸ’₯

Our method trains encoders such that distances in latent space equal bisimulation distances in state space. PSM: r(s,a) ---> pi(a|s)

πŸ”Ή Towards Robust Bisimulation Metric Learning πŸŒ‹ πŸ’₯

we generalize value function approximation bounds for on-policy bisimulation metrics to non-optimal policies and approximate environment dynamics. Our theoretical results help us identify embedding pathologies that may occur in practical use. In particular, we find that these issues stem from an underconstrained dynamics model and an unstable dependence of the embedding norm on the reward signal in environments with sparse rewards.

πŸ”Ή TASK-INDUCED REPRESENTATION LEARNING πŸŒ‹

We formalize the problem of task-induced 11 representation learning (TARP), which aims to leverage such task information in offline experience from prior tasks for learning compact representations that focus 13 on modelling only task-relevant aspects.

πŸ”Ή LEARNING GENERALIZABLE REPRESENTATIONS FOR REINFORCEMENT LEARNING VIA ADAPTIVE METALEARNER OF BEHAVIORAL SIMILARITIES πŸ‘ πŸ”₯

Meta-learner of Behavioral Similarities (AMBS): A pair of meta-learners is developed, one of which quantifies the reward similarity and the other of which quantifies dynamics similarity over the correspondingly decomposed embeddings. The meta-learners are self-learned to update the state embeddings by approximating two disjoint terms in on-policy bisimulation metric.

πŸ”Ή LEARNING INVARIANT FEATURE SPACES TO TRANSFER SKILLS WITH REINFORCEMENT LEARNING https://arxiv.org/pdf/1703.02949.pdf πŸ”₯ πŸ‘

differ in state-space, action-space, and dynamics.

Our method uses the skills that were learned by both agents to train invariant feature spaces that can then be used to transfer other skills from one agent to another.

πŸ”Ή UIUC: CS 598 Statistical Reinforcement Learning (S19) NanJiang πŸ‘πŸ¦… πŸ¦…

πŸ”Ή CONTRASTIVE BEHAVIORAL SIMILARITY EMBEDDINGS FOR GENERALIZATION IN REINFORCEMENT LEARNING πŸ’₯ πŸ’¦ ​

βŒ› πŸ’  πŸ’  βŒ› Representation learning. βŒ› πŸ’  πŸ’  βŒ›

πŸ”Ή Does Self-supervised Learning Really Improve Reinforcement Learning from Pixels? πŸ”₯

we fail to find a single self-supervised loss or a combination of multiple SSL methods that consistently improve RL under the existing joint learning framework with image augmentation.

πŸ”Ή CURL: Contrastive Unsupervised Representations for Reinforcement Learning πŸ”₯ πŸŒ‹ πŸ’§ ​ ​

πŸ”Ή Denoised MDPs: Learning World Models Better Than the World Itself πŸ”₯ πŸŒ‹

This framework clarifies the kinds information (controllable and reward-relevant) removed by various prior work on representation learning in reinforcement learning (RL), and leads to our proposed approach of learning a Denoised MDP that explicitly factors out certain noise distractors.

πŸ”Ή MASTERING VISUAL CONTINUOUS CONTROL: IMPROVED DATA-AUGMENTED REINFORCEMENT LEARNING πŸ”₯

DrQ-v2 builds on DrQ, an off-policy actor-critic approach that uses data augmentation to learn directly from pixels.

πŸ”Ή Stabilizing Deep Q-Learning with ConvNets and Vision Transformers under Data Augmentation πŸ‘

By only applying augmentation in Q-value estimation of the current state, without augmenting Q-targets used for bootstrapping, SVEA circumvents erroneous bootstrapping caused by data augmentation.

πŸ”Ή Sim-to-Real via Sim-to-Sim: Data-efficient Robotic Grasping via Randomized-to-Canonical Adaptation Networks πŸ‘ πŸ”₯

Our method learns to translate randomized rendered images into their equivalent non-randomized, canonical versions. This in turn allows for real images to also be translated into canonical sim images.

πŸ”Ή Time-contrastive networks: Self-supervised learning from video πŸ”₯ ​

πŸ”Ή Data-Efficient Reinforcement Learning with Self-Predictive Representations πŸ”₯ ​ ​ ​

SPR:

πŸ”Ή Value-Consistent Representation Learning for Data-Efficient Reinforcement Learning πŸ”₯ πŸ”₯

VCR: Instead of aligning this imagined state with a real state returned by the environment, VCR applies a Q-value head on both states and obtains two distributions of action values. Then a distance is computed and minimized to force the imagined state to produce a similar action value prediction as that by the real state.

πŸ”Ή Intrinsically Motivated Self-supervised Learning in Reinforcement Learning πŸ‘ πŸ”₯ ​

employ self-supervised loss as an intrinsic reward, called Intrinsically Motivated Self-Supervised learning in Reinforcement learning (IM-SSR). Decomposition and Interpretation of Contrastive Loss.

πŸ”Ή INFORMATION PRIORITIZATION THROUGH EMPOWERMENT IN VISUAL MODEL-BASED RL πŸ‘ πŸ”₯

InfoPower: We propose a modified objective for model-based RL that, in combination with mutual information maximization, allows us to learn representations and dynamics for visual model-based RL without reconstruction in a way that explicitly prioritizes functionally relevant factors.

πŸ”Ή PlayVirtual: Augmenting Cycle-Consistent Virtual Trajectories for Reinforcement Learning πŸ‘

PlayVirtual predicts future states in a latent space based on the current state and action by a dynamics model and then predicts the previous states by a backward dynamics model, which forms a trajectory cycle. Based on this, we augment the actions to generate a large amount of virtual state-action trajectories.

πŸ”Ή Masked World Models for Visual Control 😢

We train an autoencoder with convolutional layers and vision transformers (ViT) to reconstruct pixels given masked convolutional features, and learn a latent dynamics model that operates on the representations from the autoencoder. Moreover, to encode task-relevant information, we introduce an auxiliary reward prediction objective for the autoencoder.

πŸ”Ή EMI: Exploration with Mutual Information πŸ‘

We propose EMI, which is an exploration method that constructs embedding representation of states and actions that does not rely on generative decoding of the full observation but extracts predictive signals that can be used to guide exploration based on forward prediction in the representation space.

πŸ”Ή Bootstrap Latent-Predictive Representations for Multitask Reinforcement Learning πŸ‘ πŸ”₯ πŸŒ‹

The forward prediction encourages the agent state to move away from collapsing in order to accurately predict future random projections of observations. Similarly, the reverse prediction encourages the latent observation away from collapsing in order to accurately predict the random projection of a full history. As we continue to train forward and reverse predictions, this seems to result in a virtuous cycle that continuously enriches both representations.

πŸ”Ή Unsupervised Domain Adaptation with Shared Latent Dynamics for Reinforcement Learning πŸ‘

The model achieves the alignment between the latent codes via learning shared dynamics for different environments and matching marginal distributions of latent codes.

πŸ”Ή RETURN-BASED CONTRASTIVE REPRESENTATION LEARNING FOR REINFORCEMENT LEARNING πŸ‘ πŸ”₯

Our auxiliary loss is theoretically justified to learn representations that capture the structure of a new form of state-action abstraction, under which state-action pairs with similar return distributions are aggregated together. Related work: AUXILIARY TASK + ABSTRACTION.

πŸ”Ή Representation Matters: Offline Pretraining for Sequential Decision Making πŸ‘

πŸ”Ή SELF-SUPERVISED POLICY ADAPTATION DURING DEPLOYMENT πŸ‘ πŸ”₯ ​

test time training TTT Our work explores the use of self-supervision to allow the policy to continue training after deployment without using any rewards.

πŸ”Ή Test-Time Training with Masked Autoencoders πŸ”₯

Test-time training adapts to a new test distribution on the fly by optimizing a model for each test input using self-supervision. In this paper, we use masked autoencoders for this one-sample learning problem.

πŸ”Ή MEMO: Test Time Robustness via Adaptation and Augmentation πŸ”₯

MEMO:when presented with a test example, perform different data augmentations on the data point, and then adapt (all of) the model parameters by minimizing the entropy of the model’s average, or marginal, output distribution across the augmentations. Intuitively, this objective encourages the model to make the same prediction across different augmentations, thus enforcing the invariances encoded in these augmentations, while also maintaining confidence in its predictions.

πŸ”Ή What Makes for Good Views for Contrastive Learning? πŸ‘ πŸ”₯ πŸ’₯ πŸŒ‹

we should reduce the mutual information (MI) between views while keeping task-relevant information intact.

πŸ”Ή SELF-SUPERVISED LEARNING FROM A MULTI-VIEW PERSPECTIVE πŸ‘ πŸ”₯ ​ ​

Demystifying Self-Supervised Learning: An Information-Theoretical Framework.

πŸ”ΉCONTRASTIVE BEHAVIORAL SIMILARITY EMBEDDINGS FOR GENERALIZATION IN REINFORCEMENT LEARNING πŸ‘ πŸ‘

policy similarity metric (PSM) for measuring behavioral similarity between states. PSM assigns high similarity to states for which the optimal policies in those states as well as in future states are similar.

πŸ”Ή Improving Zero-shot Generalization in Offline Reinforcement Learning using Generalized Similarity Functions πŸ”₯

We propose a new theoretically-motivated framework called Generalized Similarity Functions (GSF), which uses contrastive learning to train an offline RL agent to aggregate observations based on the similarity of their expected future behavior, where we quantify this similarity using generalized value functions.

πŸ”Ή Invariant Causal Prediction for Block MDPs πŸ’¦

State Abstractions and Bisimulation; Causal Inference Using Invariant Prediction;

πŸ”Ή Learning Domain Invariant Representations in Goal-conditioned Block MDPs

πŸ”Ή CAUSAL INFERENCE Q-NETWORK: TOWARD RESILIENT REINFORCEMENT LEARNING πŸ‘

In this paper, we consider a resilient DRL framework with observational interferences.

πŸ”Ή Decoupling Value and Policy for Generalization in Reinforcement Learning πŸ‘ πŸ”₯ ​

Invariant Decoupled Advantage ActorCritic. First, IDAAC decouples the optimization of the policy and value function, using separate networks to model them. Second, it introduces an auxiliary loss which encourages the representation to be invariant to task-irrelevant properties of the environment.

πŸ”Ή Robust Deep Reinforcement Learning against Adversarial Perturbations on State Obs πŸ”₯ πŸŒ‹ πŸ’§ ​ ​

We propose the state-adversarial Markov decision process (SA-MDP) to study the fundamental properties of this problem, and develop a theoretically principled policy regularization which can be applied to a large family of DRL algorithms.

πŸ”Ή Understanding Adversarial Attacks on Observations in Deep Reinforcement Learning πŸ’¦ ​

πŸ”Ή ROBUST REINFORCEMENT LEARNING ON STATE OBSERVATIONS WITH LEARNED OPTIMAL ADVERSARY

πŸ”Ή Loss is its own Reward: Self-Supervision for Reinforcement Learning πŸ‘ πŸ’₯ ​

To augment reward, we consider a range of selfsupervised tasks that incorporate states, actions, and successors to provide auxiliary losses.

πŸ”Ή Unsupervised Learning of Visual 3D Keypoints for Control πŸ‘ πŸ’₯

motivation: most of these representations, whether structured or unstructured are learned in a 2D space even though the control tasks are usually performed in a 3D environment.

πŸ”Ή Which Mutual-Information Representation Learning Objectives are Sufficient for Control? πŸ‘ πŸ’₯ πŸŒ‹

we formalize the sufficiency of a state representation for learning and representing the optimal policy, and study several popular mutual-information based objectives through this lens. ​

πŸ”Ή Towards a Unified Theory of State Abstraction for MDPs πŸ‘ πŸ”₯πŸŒ‹ πŸ’§ ​ ​ ​

We provide a unified treatment of state abstraction for Markov decision processes. We study five particular abstraction schemes.

πŸ”Ή Learning State Abstractions for Transfer in Continuous Control πŸ”₯ ​

Our main contribution is a learning algorithm that abstracts a continuous state-space into a discrete one. We transfer this learned representation to unseen problems to enable effective learning.

πŸ”Ή Multi-Modal Mutual Information (MuMMI) Training for Robust Self-Supervised Deep Reinforcement Learning πŸ‘ πŸ”₯

we contribute a new multi-modal deep latent state-space model, trained using a mutual information lower-bound.

πŸ”Ή LEARNING ACTIONABLE REPRESENTATIONS WITH GOAL-CONDITIONED POLICIES πŸ‘ πŸŒ‹ ​ ​

Aim to capture those factors of variation that are important for decision making – that are β€œactionable.” These representations are aware of the dynamics of the environment, and capture only the elements of the observation that are necessary for decision making rather than all factors of variation.

πŸ”Ή Adaptive Auxiliary Task Weighting for Reinforcement Learning πŸ‘

Dynamically combines different auxiliary tasks to speed up training for reinforcement learning: Our method is based on the idea that auxiliary tasks should provide gradient directions that, in the long term, help to decrease the loss of the main task.

πŸ”Ή Scalable methods for computing state similarity in deterministic Markov Decision Processes

Computing and approximating bisimulation metrics in large deterministic MDPs.

πŸ”Ή Value Preserving State-Action Abstractions πŸ˜•

We proved which state-action abstractions are guaranteed to preserve representation of high value policies. To do so, we introduced -relative options, a simple but expressive formalism for combining state abstractions with options.

πŸ”Ή Learning Markov State Abstractions for Deep Reinforcement Learning πŸŒ‹ πŸ˜• πŸ’§

We introduce a novel set of conditions and prove that they are sufficient for learning a Markov abstract state representation. We then describe a practical training procedure that combines inverse model estimation and temporal contrastive learning to learn an abstraction that approximately satisfies these conditions.

πŸ”Ή Cross-Trajectory Representation Learning for Zero-Shot Generalization in RL πŸ˜• πŸ’§

CTRL: We posit that a superior encoder for zero-shot generalization in RL can be trained by using solely an auxiliary SSL objective if the training process encourages the encoder to map behaviorally similar observations to similar representations.

πŸ”Ή Jointly-Learned State-Action Embedding for Efficient Reinforcement Learning 😢

We establish the theoretical foundations for the validity of training a rl agent using embedded states and actions. We then propose a new approach for jointly learning embeddings for states and actions that combines model-free and model-based rl.

πŸ”Ή Metrics and continuity in reinforcement learning πŸŒ‹

We introduce a unified formalism for defining these topologies through the lens of metrics. We establish a hierarchy amongst these metrics and demonstrate their theoretical implications on the Markov Decision Process specifying the rl problem.

πŸ”Ή Environment Shaping in Reinforcement Learning using State Abstraction πŸ’¦ πŸŒ‹

Our key idea is to compress the environment’s large state space with noisy signals to an abstracted space, and to use this abstraction in creating smoother and more effective feedback signals for the agent. We study the theoretical underpinnings of our abstractionbased environment shaping, and show that the agent’s policy learnt in the shaped environment preserves near-optimal behavior in the original environment.

πŸ”Ή A RELATIONAL INTERVENTION APPROACH FOR UNSUPERVISED DYNAMICS GENERALIZATION IN MODEL-BASED REINFORCEMENT LEARNING πŸ”₯ πŸŒ‹

Because environments are not labelled, the extracted information inevitably contains redundant information unrelated to the dynamics in transition segments and thus fails to maintain a crucial property of Z: Z should be similar in the same environment and dissimilar in different ones. we introduce an interventional prediction module to estimate the probability of two estimated zi , zj belonging to the same environment.

πŸ”Ή Cross-Trajectory Representation Learning for Zero-Shot Generalization in RL πŸ”₯

We propose Cross Trajectory Representation Learning (CTRL), a method that runs within an RL agent and conditions its encoder to recognize behavioral similarity in observations by applying a novel SSL objective to pairs of trajectories from the agent’s policies.

πŸ”Ή Bayesian Imitation Learning for End-to-End Mobile Manipulation πŸ‘

We show that using the Variational Information Bottleneck to regularize convolutional neural networks improves generalization to held-out domains and reduces the sim-to-real gap in a sensor-agnostic manner. As a side effect, the learned embeddings also provide useful estimates of model uncertainty for each sensor.

πŸ”Ή Control-Aware Representations for Model-based Reinforcement Learning πŸ‘ πŸŒ‹ πŸ’₯ πŸ’₯

CARL: How to learn a representation that is amenable to the control problem at hand, and how to achieve an end-to-end framework for representation learning and control: We first formulate a learning controllable embedding (LCE) model to learn representations that are suitable to be used by a policy iteration style algorithm in the latent space. We call this model control-aware representation learning (CARL). We derive a loss function for CARL that has close connection to the prediction, consistency, and curvature (PCC) principle for representation learning.

πŸ”Ή Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images πŸ”₯

E2C: Embed to Control (E2C) consists of a deep generative model, belonging to the family of variational autoencoders, that learns to generate image trajectories from a latent space in which the dynamics is constrained to be locally linear.

πŸ”Ή Robust Locally-Linear Controllable Embedding πŸ”₯

RCE: propose a principled variational approximation of the embedding posterior that takes the future observation into account, and thus, makes the variational approximation more robust against the noise.

πŸ”Ή SOLAR: Deep Structured Representations for Model-Based Reinforcement Learning 😢

SOLAR: we present a method for learning representations that are suitable for iterative model-based policy improvement.

πŸ”Ή DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION πŸ‘

Dreamer: (Learning long-horizon behaviors by latent imagination) predicting both actions and state values.

πŸ”Ή Learning Task Informed Abstractions 😢

Task Informed Abstractions (TIA) that explicitly separates rewardcorrelated visual features from distractors.

πŸ”Ή PREDICTION, CONSISTENCY, CURVATURE: REPRESENTATION LEARNING FOR LOCALLY-LINEAR CONTROL πŸ‘ πŸ”₯ πŸŒ‹

PCC: We propose the Prediction, Consistency, and Curvature (PCC) framework for learning a latent space that is amenable to locally-linear control (LLC) algorithms and show that the elements of PCC arise systematically from bounding the suboptimality of the solution of the LLC algorithm in the latent space.

πŸ”Ή Predictive Coding for Locally-Linear Control πŸ”₯ πŸŒ‹

PC3: we propose a novel information-theoretic LCE approach and show theoretically that explicit next-observation prediction can be replaced with predictive coding. We then use predictive coding to develop a decoder-free LCE model whose latent dynamics are amenable to locally-linear control.

πŸ”Ή Robust Predictable Control πŸ”₯ πŸŒ‹ πŸ’§

RPC: Our objective differs from prior work by compressing sequences of observations, resulting in a method that jointly trains a policy and a model to be self-consistent.

πŸ”Ή Representation Gap in Deep Reinforcement Learning πŸŒ‹

We propose Policy Optimization from Preventing Representation Overlaps (POPRO), which regularizes the policy evaluation phase through differing the representation of action value function from its target.

πŸ”Ή TRANSFER RL ACROSS OBSERVATION FEATURE SPACES VIA MODEL-BASED REGULARIZATION πŸ”₯ πŸ”₯

We propose to learn a latent dynamics model in the source task and transfer the model to the target task to facilitate representation learning (+heoretical analysis).

πŸ”Ή Sample-Efficient Reinforcement Learning in the Presence of Exogenous Information πŸ”₯

ExoMDP: the state space admits an (unknown) factorization into a small controllable (or, endogenous) component and a large irrelevant (or, exogenous) component; the exogenous component is independent of the learner’s actions, but evolves in an arbitrary, temporally correlated fashion.

πŸ”Ή Stabilizing Off-Policy Deep Reinforcement Learning from Pixels πŸŒ‹

A-LIX: [poster]

πŸ”Ή Temporal Disentanglement of Representations for Improved Generalisation in Reinforcement Learning πŸ‘

we introduce TEmporal Disentanglement (TED), a self-supervised auxiliary task that leads to disentangled representations using the sequential nature of RL observations.

πŸ”Ή R3M: A Universal Visual Representation for Robot Manipulation

We study how visual representations pre-trained on diverse human video data can enable data-efficient learning of downstream robotic manipulation tasks.

πŸ”Ή PsiPhi-Learning: Reinforcement Learning with Demonstrations using Successor Features and Inverse Temporal Difference Learning πŸ”₯ πŸŒ‹

We propose a multi-task inverse reinforcement learning (IRL) algorithm, called inverse temporal difference learning (ITD), that learns shared state features, alongside peragent successor features and preference vectors, purely from demonstrations without reward labels. We further ...

πŸ”Ή LOOK WHERE YOU LOOK! SALIENCY-GUIDED Q-NETWORKS FOR VISUAL RL TASKS πŸ”₯ πŸŒ‹

SGQN: a good visual policy should be able to identify which pixels are important for its decision, and preserve this identification of important sources of information across images.

πŸ”Ή Improving Deep Learning Interpretability by Saliency Guided Training πŸ”₯ πŸŒ‹

Saliency Guided Training: Our saliency guided training procedure iteratively masks features with small and potentially noisy gradients while maximizing the similarity of model outputs for both masked and unmasked inputs.

πŸ”Ή Saliency Guided Adversarial Training for Learning Generalizable Features with Applications to Medical Imaging Classification System πŸ”₯

We hypothesize that adversarial training can eliminate shortcut features whereas saliency guided training can filter out non-relevant features; both are nuisance features accounting for the performance degradation on OOD test sets.

πŸ”Ή VISFIS: Visual Feature Importance Supervision with Right-for-the-Right-Reason Objectives πŸ”₯ πŸŒ‹

VISFIS: (1) accurate predictions given limited but sufficient information (Sufficiency); (2) max-entropy predictions given no important information (Uncertainty); (3) invariance of predictions to changes in unimportant features (Invariance); and (4) alignment between model FI explanations and human FI explanations (Plausibility).

πŸ”Ή Concept Embedding Models πŸ‘

CEM: we propose Concept Embedding Models, a novel family of concept bottleneck models which goes beyond the current accuracy-vs-interpretability trade-off by learning interpretable highdimensional concept representations.

πŸ”Ή Invariance Through Latent Alignment πŸ”₯

ILA performs unsupervised adaptation at deployment-time by matching the distribution of latent features on the target domain to the agent’s prior experience, without relying on paired data.

πŸ”Ή Does Zero-Shot Reinforcement Learning Exist? πŸ”₯ πŸŒ‹ πŸ’§

Strategies for approximate zero-shot RL have been suggested using successor features (SFs) or forward-backward (FB) representations, but testing has been limited.

πŸ”Ή Learning Successor States and Goal-Dependent Values: A Mathematical Viewpoint πŸ’§

πŸ”Ή Learning One Representation to Optimize All Rewards πŸ”₯ πŸŒ‹ πŸ’₯

We introduce the forward-backward (FB) representation of the dynamics of a reward-free Markov decision process. It provides explicit near-optimal policies for any reward specified a posteriori.

πŸ”Ή TOWARDS UNIVERSAL VISUAL REWARD AND REPRESENTATION VIA VALUE-IMPLICIT PRE-TRAINING πŸ‘ πŸ”₯ πŸŒ‹ πŸ’₯ πŸ’§

Value-Implicit Pre-training (VIP), a self-supervised pre-trained visual representation capable of generating dense and smooth reward functions for unseen robotic tasks.

πŸ”Ή VRL3: A Data-Driven Framework for Visual Deep Reinforcement Learning πŸ’§

πŸ”Ή CLOUD: Contrastive Learning of Unsupervised Dynamics πŸ‘

CLOUD: we train a forward dynamics model and an inverse dynamics model in the feature space of states and actions with data collected from random exploration.

πŸ”Ή Object-Category Aware Reinforcement Learning πŸ”₯ πŸ‘

OCARL consists of three parts: (1) category-aware unsupervised object discover (category-aware UOD) module, (2) object-category aware perception (OCAP), and (3) object-centric modular reasoning (OCMR) module.

Mutual Information

πŸ”Ή MINE: Mutual Information Neural Estimation πŸ‘πŸ’§ πŸ”₯ f-gan & mine πŸ’¦

πŸ”Ή IMPROVING MUTUAL INFORMATION ESTIMATION WITH ANNEALED AND ENERGY-BASED BOUNDS

Multi-Sample Annealed Importance Sampling (AIS):

πŸ”Ή C-MI-GAN : Estimation of Conditional Mutual Information Using MinMax Formulation πŸ‘ πŸ”₯ ​ ​

πŸ”Ή Deep InfoMax: LEARNING DEEP REPRESENTATIONS BY MUTUAL INFORMATION ESTIMATION AND MAXIMIZATION πŸ‘πŸ’§ ​

πŸ”Ή ON MUTUAL INFORMATION MAXIMIZATION FOR REPRESENTATION LEARNING πŸ’¦ πŸ‘ ​

πŸ”Ή [Deep Reinforcement and InfoMax Learning](Deep Reinforcement and InfoMax Learning) πŸ’¦ πŸ‘ πŸ˜• πŸ’§

Our work is based on the hypothesis that a model-free agent whose representations are predictive of properties of future states (beyond expected rewards) will be more capable of solving and adapting to new RL problems, and in a way, incorporate aspects of model-based learning.

πŸ”Ή Unpacking Information Bottlenecks: Unifying Information-Theoretic Objectives in Deep Learning πŸŒ‹

New: Unpacking Information Bottlenecks: Surrogate Objectives for Deep Learning

πŸ”Ή Opening the black box of Deep Neural Networ ksvia Information

β­• β­• UYANG:

ε°ηŽ‹ηˆ±θΏη§»,

πŸ”Ή ​[Self-Supervised Representation Learning From Multi-Domain Data](Self-Supervised Representation Learning From Multi-Domain Data), πŸ”₯ πŸ‘ πŸ‘

The proposed mutual information constraints encourage neural network to extract common invariant information across domains and to preserve peculiar information of each domain simultaneously. We adopt tractable upper and lower bounds of mutual information to make the proposed constraints solvable.

πŸ”Ή ​Unsupervised Domain Adaptation via Regularized Conditional Alignment, πŸ”₯ πŸ‘

Joint alignment ensures that not only the marginal distributions of the domains are aligned, but the labels as well.

πŸ”Ή ​Domain Adaptation with Conditional Distribution Matching and Generalized Label Shift, πŸ”₯ πŸ’₯ πŸ’₯ πŸ’¦

In this paper, we extend a recent upper-bound on the performance of adversarial domain adaptation to multi-class classification and more general discriminators. We then propose generalized label shift (GLS) as a way to improve robustness against mismatched label distributions. GLS states that, conditioned on the label, there exists a representation of the input that is invariant between the source and target domains.

πŸ”Ή ​Learning to Learn with Variational Information Bottleneck for Domain Generalization,

Through episodic training, MetaVIB learns to gradually narrow domain gaps to establish domain-invariant representations, while simultaneously maximizing prediction accuracy.

πŸ”Ή ​Deep Domain Generalization via Conditional Invariant Adversarial Networks, πŸ‘

πŸ”Ή ​On Learning Invariant Representation for Domain Adaptation πŸ”₯ πŸ’₯ πŸ’¦

πŸ”Ή ​GENERALIZING ACROSS DOMAINS VIA CROSS-GRADIENT TRAINING πŸ”₯ πŸŒ‹ πŸŒ‹

In contrast, in our setting, we wish to avoid any such explicit domain representation, appealing instead to the power of deep networks to discover implicit features. We also argue that even if such such overfitting could be avoided, we do not necessarily want to wipe out domain signals, if it helps in-domain test instances.

πŸ”Ή ​In Search of Lost Domain Generalization 😢

πŸ”Ή DIRL: Domain-Invariant Representation Learning for Sim-to-Real Transfer πŸ’¦ ​

β­• β­• self-supervised learning

πŸ”Ή Bootstrap Your Own Latent A New Approach to Self-Supervised Learning πŸ”₯ πŸŒ‹

Related work is good! ​ ​

πŸ”Ή Model-Based Relative Entropy Stochastic Search

MORE:

πŸ”Ή Efficient Gradient-Free Variational Inference using Policy Search

VIPS: Our method establishes information-geometric trust regions to ensure efficient exploration of the sampling space and stability of the GMM updates, allowing for efficient estimation of multi-variate Gaussian variational distributions.

πŸ”Ή EXPECTED INFORMATION MAXIMIZATION USING THE I-PROJECTION FOR MIXTURE DENSITY ESTIMATION πŸ”₯

EIM: we present a new algorithm called Expected Information Maximization (EIM) for computing the I-projection solely based on samples for general latent variable models.

πŸ”Ή An Information-theoretic Approach to Distribution Shifts πŸ’§

πŸ”Ή An Asymmetric Contrastive Loss for Handling Imbalanced Datasets πŸ”₯ πŸŒ‹

we propose the asymmetric focal contrastive loss (AFCL) as a further generalization of both ACL and focal contrastive loss (FCL).

DR (Domain Randomization) & sim2real

πŸ”Ή Active Domain Randomization https://proceedings.mlr.press/v100/mehta20a/mehta20a.pdf πŸ”₯ πŸ’₯ πŸ”₯

Our method looks for the most informative environment variations within the given randomization ranges by leveraging the discrepancies of policy rollouts in randomized and reference environment instances. We find that training more frequently on these instances leads to better overall agent generalization.

Domain Randomization; Stein Variational Policy Gradient;

Bhairav Mehta On Learning and Generalization in Unstructured Task Spaces πŸ’¦ πŸ’¦

πŸ”Ή VADRA: Visual Adversarial Domain Randomization and Augmentation πŸ”₯ πŸ‘ generative + learner

πŸ”Ή Which Training Methods for GANs do actually Converge? πŸ‘ πŸ’§ ODE: GAN

πŸ”Ή Robust Adversarial Reinforcement Learning 😢 ​

Robust Adversarial Reinforcement Learning (RARL), jointly trains a pair of agents, a protagonist and an adversary, where the protagonist learns to fulfil the original task goals while being robust to the disruptions generated by its adversary.

πŸ”Ή Closing the Sim-to-Real Loop: Adapting Simulation Randomization with Real World Experience 😢 ​

πŸ”Ή POLICY TRANSFER WITH STRATEGY OPTIMIZATION 😢 ​

The key idea that, instead of learning a single policy in the simulation, we simultaneously learn a family of policies that exhibit different behaviors. When tested in the target environment, we directly search for the best policy in the family based on the task performance, without the need to identify the dynamic parameters.

πŸ”Ή https://lilianweng.github.io/lil-log/2019/05/05/domain-randomization.html πŸ’¦

πŸ”Ή THE INGREDIENTS OF REAL-WORLD ROBOTIC REINFORCEMENT LEARNING 😢 ​

πŸ”Ή ROBUST REINFORCEMENT LEARNING ON STATE OBSERVATIONS WITH LEARNED OPTIMAL ADVERSARY πŸ‘

To enhance the robustness of an agent, we propose a framework of alternating training with learned adversaries (ATLA), which trains an adversary online together with the agent using policy gradient following the optimal adversarial attack framework.

πŸ”Ή SELF-SUPERVISED POLICY ADAPTATION DURING DEPLOYMENT πŸ‘ πŸ”₯

Our work explores the use of self-supervision to allow the policy to continue training after deployment without using any rewards.

πŸ”Ή SimGAN: Hybrid Simulator Identification for Domain Adaptation via Adversarial Reinforcement Learning πŸ‘

identifying a hybrid physics simulator to match the simulated \tau to the ones from the target domain, using a learned discriminative loss to address the limitations associated with manual loss design. Our hybrid simulator combines nns and traditional physics simulaton to balance expressiveness and generalizability, and alleviates the need for a carefully selected parameter set in System ID.

πŸ”Ή Generalization of Reinforcement Learning with Policy-Aware Adversarial Data Augmentation 😢

our proposed method adversarially generates new trajectory data based on the policy gradient objective and aims to more effectively increase the RL agent’s generalization ability with the policy-aware data augmentation.

πŸ”Ή Understanding Domain Randomization for Sim-to-real Transfer πŸŒ‹ πŸ’§

We provide sharp bounds on the sim-to-real gapβ€”the difference between the value of policy returned by domain randomization and the value of an optimal policy for the real world.

🏳️ see more robustness in model-based setting

πŸ”Ή EPOPT: LEARNING ROBUST NEURAL NETWORK POLICIES USING MODEL ENSEMBLES 😢

Our method provides for training of robust policies, and supports an adversarial training regime designed to provide good direct-transfer performance. We also describe how our approach can be combined with Bayesian model adaptation to adapt the source domain ensemble to a target domain using a small amount of target domain experience.

πŸ”Ή Action Robust Reinforcement Learning and Applications in Continuous Control πŸ”₯ πŸ’§

We have presented two new criteria for robustness, the Probabilistic and Noisy action Robust MDP, related each to real world scenarios of uncertainty and discussed the theoretical differences between both approaches.

πŸ”Ή Robust Policy Learning over Multiple Uncertainty Sets πŸ’§

System Identification and Risk-Sensitive Adaptation (SIRSA):

πŸ”Ή βˆ‡Sim: DIFFERENTIABLE SIMULATION FOR SYSTEM IDENTIFICATION AND VISUOMOTOR CONTROL πŸ‘ πŸ’§

πŸ”Ή RISP: RENDERING-INVARIANT STATE PREDICTOR WITH DIFFERENTIABLE SIMULATION AND RENDERING FOR CROSS-DOMAIN PARAMETER ESTIMATION πŸ”₯ πŸŒ‹ πŸ‘

This work considers identifying parameters characterizing a physical system’s dynamic motion directly from a video whose rendering configurations are inaccessible. Our core idea is to train a rendering-invariant state-prediction (RISP) network that transforms image differences into state differences independent of rendering configurations.

πŸ”Ή Sim and Real: Better Together πŸ”₯ πŸ’§

By separating the rate of collecting samples from each environment and the rate of choosing samples for the optimization process, we were able to achieve a significant reduction in the amount of real environment samples, comparing to the common strategy of using the same rate for both collection and optimization phases.

πŸ”Ή Online Robust Reinforcement Learning with Model Uncertainty πŸŒ‹

We develop a sample-based approach to estimate the unknown uncertainty set, and design robust Q-learning algorithm (tabular case) and robust TDC algorithm (function approximation setting).

πŸ”Ή Robust Deep Reinforcement Learning through Adversarial Loss πŸ”₯ πŸ”₯

RADIAL-RL: Construct an strict upper bound of the perturbed standard loss; Design a regularizer to minimize overlap between output bounds of actions with large difference in outcome.

πŸ”Ή Robust Deep Reinforcement Learning through Bootstrapped Opportunistic Curriculum πŸ’§

πŸ”Ή Robust Reinforcement Learning using Offline Data πŸ”₯ πŸ‘

This poses challenges in offline data collection, optimization over the models, and unbiased estimation. In this work, we propose a systematic approach to overcome these challenges, resulting in our RFQI algorithm.

​

Transfer: Generalization & Adaption (Dynamics)

πŸ”Ή Automatic Data Augmentation for Generalization in Deep Reinforcement Learning πŸ‘Š πŸ‘ ​ ​ ​

Across different visual inputs (with the same semantics), dynamics, or other environment structures

πŸ”Ή Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning from Pixels πŸ‘

πŸ”Ή Fast Adaptation to New Environments via Policy-Dynamics Value Functions πŸ”₯ πŸ’₯ πŸ‘ ​

PD-VF explicitly estimates the cumulative reward in a space of policies and environments.

πŸ”Ή Off-Dynamics Reinforcement Learning: Training for Transfer with Domain Classifiers πŸ”₯ πŸ’₯ πŸŒ‹ πŸ’§

DARC: The main contribution of this work is an algorithm for domain adaptation to dynamics changes in RL, based on the idea of compensating for differences in dynamics by modifying the reward function. This algorithm does not need to estimate transition probabilities, but rather modifies the reward function using a pair of classifiers.

πŸ”Ή DARA: DYNAMICS-AWARE REWARD AUGMENTATION IN OFFLINE REINFORCEMENT LEARNING πŸ”₯

πŸ”Ή When to Trust Your Simulator: Dynamics-Aware Hybrid Offline-and-Online Reinforcement Learning πŸ”₯

H2O introduces a dynamics-aware policy evaluation scheme, which adaptively penalizes the Q function learning on simulated stateaction pairs with large dynamics gaps, while also simultaneously allowing learning from a fixed real-world dataset.

πŸ”Ή DOMAIN TRANSFER WITH LARGE DYNAMICS SHIFT IN OFFLINE REINFORCEMENT LEARNING πŸ‘

the source data will play two roles. One is to serve as augmentation data by compensating for the difference in dynamics with modified reward. Another is to form prior knowledge for the behaviour policy to collect a small amount of new data in the target domain safely and efficiently.

πŸ”Ή TARGETED ENVIRONMENT DESIGN FROM OFFLINE DATA πŸ‘ πŸ”₯

OTED: which automatically learns a distribution over simulator parameters to match a provided offline dataset, and then uses the learned simulator to train an RL agent in standard online fashion.

πŸ”Ή Learning MDPs from Features: Predict-Then-Optimize for Sequential Decision Problems by Reinforcement Learning πŸ”₯

This paper considers learning a predictive model to address the missing parameters in sequential decision problems.

  • general domain adaption (DA) = importance weighting + domain-agnostic features

  • DA in RL = system identification + domain randomization + observation adaptation πŸ‘

    • formulates control as a problem of probabilistic inference πŸ’§

πŸ”Ή Unsupervised Domain Adaptation with Dynamics Aware Rewards in Reinforcement Learning πŸ‘ πŸ”₯ πŸŒ‹

DADS: we introduce a KL regularized objective to encourage emergence of skills, rewarding the agent for both discovering skills and aligning its behaviors respecting dynamics shifts.

πŸ”Ή Mutual Alignment Transfer Learning πŸ‘ πŸ”₯ ​ ​

The developed approach harnesses auxiliary rewards to guide the exploration for the real world agent based on the proficiency of the agent in simulation and vice versa.

πŸ”Ή SimGAN: Hybrid Simulator Identification for Domain Adaptation via Adversarial Reinforcement Learning [real dog] πŸ‘ πŸ”₯ ​ ​

a framework to tackle domain adaptation by identifying a hybrid physics simulator to match the simulated trajectories to the ones from the target domain, using a learned discriminative loss to address the limitations associated with manual loss design.

πŸ”Ή Disentangled Skill Embeddings for Reinforcement Learning πŸ”₯ πŸŒ‹ πŸ’₯ πŸ’₯ ​ ​ ​ ​

We have developed a multi-task framework from a variational inference perspective that is able to learn latent spaces that generalize to unseen tasks where the dynamics and reward can change independently.

πŸ”Ή Transfer Learning in Deep Reinforcement Learning: A Survey πŸ’¦ ​

Evaluation metrics: Mastery and Generalization.

TRANSFER LEARNING APPROACHES: Reward Shaping; Learning from Demonstrations; Policy Transfer (Transfer Learning via Policy Distillation, Transfer Learning via Policy Reuse); Inter-Task Mapping; Representation Transfer(Reusing Representations, Disentangling Representations);

πŸ”Ή Provably Efficient Model-based Policy Adaptation πŸ‘ πŸ”₯ πŸŒ‹ πŸ’§ πŸ‘ ​

We prove that the approach learns policies in the target environment that can recover trajectories from the source environment, and establish the rate of convergence in general settings.

β­• reward shaping

πŸ”Ή Generalization to New Actions in Reinforcement Learning πŸ‘

We propose a two-stage framework where the agent first infers action representations from action information acquired separately from the task. A policy flexible to varying action sets is then trained with generalization objectives.

πŸ”Ή Policy Transfer across Visual and Dynamics Domain Gaps via Iterative Grounding πŸ‘ πŸ”₯

alternates between (1) directly minimizing both visual and dynamics domain gaps by grounding the source env in the target env domains, and (2) training a policy on the grounded source env.

πŸ”Ή Learning Agile Robotic Locomotion Skills by Imitating Animals πŸ‘ πŸ”₯ πŸŒ‹ ​

We show that by leveraging reference motion data, a single learning-based approach is able to automatically synthesize controllers for a diverse repertoire behaviors for legged robots. By incorporating sample efficient domain adaptation techniques into the training process, our system is able to learn adaptive policies in simulation that can then be quickly adapted for real-world deployment.

πŸ”Ή RMA: Rapid Motor Adaptation for Legged Robots πŸ‘

The robot achieves this high success rate despite never having seen unstable or sinking ground, obstructive vegetation or stairs during training. All deployment results are with the same policy without any simulation calibration, or real-world fine-tuning. ​ ​

πŸ”Ή A System for General In-Hand Object Re-Orientation πŸ‘

We present a simple model-free framework (teacher-student distillation) that can learn to reorient objects with both the hand facing upwards and downwards.​ + DAgger

πŸ”Ή LEARNING VISION-GUIDED QUADRUPEDAL LOCOMOTION END-TO-END WITH CROSS-MODAL TRANSFORMERS πŸ”₯

LocoTransformer: We propose to address quadrupedal locomotion tasks using Reinforcement Learning (RL) with a Transformer-based model that learns to combine proprioceptive information and high-dimensional depth sensor inputs.

πŸ”Ή Why Generalization in RL is Difficult: Epistemic POMDPs and Implicit Partial Observability πŸ”₯

we recast the problem of generalization in RL as solving the induced partially observed Markov decision process, which we call the epistemic POMDP.

πŸ”Ή Learning quadrupedal locomotion over challenging terrain πŸ”₯ πŸŒ‹

We present a novel solution to incorporating proprioceptive feedback in locomotion control and demonstrate remarkable zero-shot generalization from simulation to natural environments.

πŸ”Ή Rma: Rapid motor adaptation for legged robots πŸ”₯ πŸŒ‹

RMA consists of two components: a base policy and an adaptation module. The combination of these components enables the robot to adapt to novel situations in fractions of a second. RMA is trained completely in simulation without using any domain knowledge like reference trajectories or predefined foot trajectory generators and is deployed on the A1 robot without any fine-tuning.

πŸ”Ή Fast Adaptation to New Environments via Policy-Dynamics Value Functions πŸ”₯

PD-VF: explicitly estimates the cumulative reward in a space of policies and environments.

πŸ”Ή PAnDR: Fast Adaptation to New Environments from Offline Experiences via Decoupling Policy and Environment Representations πŸ”₯ πŸ‘

In offline phase, the environment representation and policy representation are learned through contrastive learning and policy recovery, respectively. The representations are further refined by mutual information optimization to make them more decoupled and complete.

πŸ”Ή Learning Robust Policy against Disturbance in Transition Dynamics via State-Conservative Policy Optimization πŸŒ‹

State-Conservative Policy Optimization (SCPO) reduces the disturbance in transition dynamics to that in state space and then approximates it by a simple gradient-based regularizer.

πŸ”Ή LEARNING A SUBSPACE OF POLICIES FOR ONLINE ADAPTATION IN REINFORCEMENT LEARNING 😢

LoP does not need any particular tuning or definition of additional architectures to handle diversity, which is a critical aspect in the online adaptation setting where hyper-parameters tuning is impossible or at least very difficult.

πŸ”Ή ADAPT-TO-LEARN: POLICY TRANSFER IN REINFORCEMENT LEARNING πŸ‘ πŸ‘ ​

New: Adaptive Policy Transfer in Reinforcement Learning

adapt the source policy to learn to solve a target task with significant transition differences and uncertainties.

πŸ”Ή Unsupervised Domain Adaptation with Dynamics Aware Rewards in Reinforcement Learning πŸ”₯ πŸŒ‹

DARS: We propose an unsupervised domain adaptation method to identify and acquire skills across dynamics. We introduce a KL regularized objective to encourage emergence of skills, rewarding the agent for both discovering skills and aligning its behaviors respecting dynamics shifts.

πŸ”Ή SINGLE EPISODE POLICY TRANSFER IN REINFORCEMENT LEARNING πŸ”₯ πŸ‘ ​ ​

Our key idea of optimized probing for accelerated latent variable inference is to train a dedicated probe policy πϕ(a|s) to generate a dataset D of short trajectories at the beginning of all training episodes, such that the VAE’s performance on D is optimized.

πŸ”Ή VARIBAD: A VERY GOOD METHOD FOR BAYES-ADAPTIVE DEEP RL VIA META-LEARNING πŸ”₯ πŸŒ‹

we introduce variational Bayes-Adaptive Deep RL (variBAD), a way to meta-learn to perform approximate inference in an unknown environment, and incorporate task uncertainty directly during action selection.

πŸ”Ή Dynamical Variational Autoencoders: A Comprehensive Review πŸ’¦ πŸ’¦ ​ ​

πŸ”Ή Dynamics Generalization via Information Bottleneck in Deep Reinforcement Learning​ πŸ”₯ ​ ​

In particular, we show that the poor generalization in unseen tasks is due to the DNNs memorizing environment observations, rather than extracting the relevant information for a task. To prevent this, we impose communication constraints as an information bottleneck between the agent and the environment.

πŸ”Ή UNIVERSAL AGENT FOR DISENTANGLING ENVIRONMENTS AND TASKS πŸ”₯ πŸŒ‹

The environment-specific unit handles how to move from one state to the target state; and the task-specific unit plans for the next target state given a specific task.

πŸ”Ή Decoupling Dynamics and Reward for Transfer Learning πŸ‘

We separate learning the task representation, the forward dynamics, the inverse dynamics and the reward function of the domain.

πŸ”Ή Neural Dynamic Policies for End-to-End Sensorimotor Learning πŸ”₯ πŸŒ‹

We propose Neural Dynamic Policies (NDPs) that make predictions in trajectory distribution space as opposed to raw control spaces. [see Abstract!] Similar in spirit to UNIVERSAL AGENT.

πŸ”Ή Accelerating Reinforcement Learning with Learned Skill Priors πŸ‘ πŸ”₯

We propose a deep latent variable model that jointly learns an embedding space of skills and the skill prior from offline agent experience. We then extend common maximumentropy RL approaches to use skill priors to guide downstream learning.

πŸ”Ή Mutual Alignment Transfer Learning πŸ‘ πŸ”₯

The developed approach harnesses auxiliary rewards to guide the exploration for the real world agent based on the proficiency of the agent in simulation and vice versa.

πŸ”Ή LEARNING CROSS-DOMAIN CORRESPONDENCE FOR CONTROL WITH DYNAMICS CYCLE-CONSISTENCY πŸ‘ πŸ‘ πŸ”₯

In this paper, we propose to learn correspondence across such domains emphasizing on differing modalities (vision and internal state), physics parameters (mass and friction), and morphologies (number of limbs). Importantly, correspondences are learned using unpaired and randomly collected data from the two domains. We propose dynamics cycles that align dynamic robotic behavior across two domains using a cycle consistency constraint.

πŸ”Ή Hierarchically Decoupled Imitation for Morphological Transfer 😢

incentivizing a complex agent’s low-level to imitate a simpler agent’s low-level significantly improves zero-shot high-level transfer; KL-regularized training of the high level stabilizes learning and prevents modecollapse.

πŸ”Ή Improving Generalization in Reinforcement Learning with Mixture Regularization πŸ‘

these approaches only locally perturb the observations regardless of the training environments, showing limited effectiveness on enhancing the data diversity and the generalization performance.

πŸ”Ή AdaRL: What, Where, and How to Adapt in Transfer Reinforcement Learning πŸ”₯ πŸ’§

we characterize a minimal set of representations, including both domain-specific factors and domain-shared state representations, that suffice for reliable and low-cost transfer.

πŸ”Ή A GENERAL THEORY OF RELATIVITY IN REINFORCEMENT LEARNING πŸ”₯ πŸŒ‹

The proposed theory deeply investigates the connection between any two cumulative expected returns defined on different policies and environment dynamics: Relative Policy Optimization (RPO) updates the policy using the relative policy gradient to transfer the policy evaluated in one environment to maximize the return in another, while Relative Transition Optimization (RTO) updates the parameterized dynamics model (if there exists) using the relative transition gradient to reduce the gap between the dynamics of the two environments.

πŸ”Ή COPA: CERTIFYING ROBUST POLICIES FOR OFFLINE REINFORCEMENT LEARNING AGAINST POISONING ATTACKS πŸ‘

We focus on certifying the robustness of offline RL in the presence of poisoning attacks, where a subset of training trajectories could be arbitrarily manipulated. We propose the first certification framework, COPA to certify the number of poisoning trajectories that can be tolerated regarding different certification criteria.

πŸ”Ή CROP: CERTIFYING ROBUST POLICIES FOR REINFORCEMENT LEARNING THROUGH FUNCTIONAL SMOOTHING πŸ”₯

We propose two particular types of robustness certification criteria: robustness of per-state actions and lower bound of cumulative rewards.

πŸ”Ή Learning Action Translator for Meta Reinforcement Learning on Sparse-Reward Tasks πŸ”₯ πŸ”₯

MCAT: we propose to learn an action translator among multiple training tasks. The objective function forces the translated action to behave on the target task similarly to the source action on the source task. We consider the policy transfer for any pair of source and target tasks in the training task distribution.

πŸ”Ή AACC: Asymmetric Actor-Critic in Contextual Reinforcement Learning πŸ”₯

the critic is trained with environmental factors and observation while the actor only gets the observation as input.

πŸ”Ή Max-Min Off-Policy Actor-Critic Method Focusing on Worst-Case Robustness to Model Misspecification πŸ”₯

Max-Min Twin Delayed Deep Deterministic Policy Gradient algorithm (M2TD3), which solves a max-min optimization problem using a simultaneous gradient ascent descent approach.

β­• Multi-task

πŸ”Ή Multi-Task Reinforcement Learning without Interference πŸ”₯

We develop a general approach that can change the multi-task optimization landscape to alleviate conflicting gradients across tasks, one architectural and one algorithmic, that prevent gradients for different tasks from interfering with one another.

πŸ”Ή Multi-Task Reinforcement Learning with Soft Modularization 😢

Given a base policy network, we design a routing network which estimates different routing strategies to reconfigure the base network for each task.

πŸ”Ή Multi-task Batch Reinforcement Learning with Metric Learning πŸ”₯

MBML: Because the different datasets may have state-action distributions with large divergence, the task inference module can learn to ignore the rewards and spuriously correlate only state-action pairs to the task identity, leading to poor test time performance. To robustify task inference, we propose a novel application of the triplet loss.

πŸ”Ή MULTI-BATCH REINFORCEMENT LEARNING VIA SAMPLE TRANSFER AND IMITATION LEARNING 😢

BAIL+ and MBAIL

πŸ”Ή Knowledge Transfer in Multi-Task Deep Reinforcement Learning for Continuous Control 😢

KTM-DRL enables a single multi-task agent to leverage the offline knowledge transfer, the online learning, and the hierarchical experience replay for achieving expert-level performance in multiple different continuous control tasks.

πŸ”Ή Multi-Task Reinforcement Learning with Context-based Representations πŸ”₯

CARE: We posit that an efficient approach to knowledge transfer is through the use of multiple context-dependent, composable representations shared across a family of tasks. Metadata can help to learn interpretable representations and provide the context to inform which representations to compose and how to compose them.

πŸ”Ή CARL: A Benchmark for Contextual and Adaptive Reinforcement Learning πŸ‘

We propose CARL, a collection of well-known RL environments extended to contextual RL problems to study generalization.

πŸ”Ή Switch Trajectory Transformer with Distributional Value Approximation for Multi-Task Reinforcement Learning 😢

We propose SwitchTT, a multi-task extension to Trajectory Transformer but enhanced with two striking features: (i) exploiting a sparsely activated model to reduce computation cost in multitask offline model learning and (ii) adopting a distributional trajectory value estimator that improves policy performance, especially in sparse reward settings.

πŸ”Ή Efficient Planning in a Compact Latent Action Space 😢

TAP avoids planning step-by-step in a high-dimensional continuous action space but instead looks for the optimal latent code sequences by beam search.

πŸ”Ή MULTI-CRITIC ACTOR LEARNING: TEACHING RL POLICIES TO ACT WITH STYLE πŸ‘

Multi-Critic Actor Learning (MultiCriticAL) proposes instead maintaining separate critics for each task being trained while training a single multi-task actor.

πŸ”Ή Investigating Generalisation in Continuous Deep Reinforcement Learning

πŸ”Ή Evolution Gym: A Large-Scale Benchmark for Evolving Soft Robots πŸ”₯

πŸ”Ή Beyond Tabula Rasa: Reincarnating Reinforcement Learning πŸŒ‹

As a step towards enabling reincarnating RL from any agent to any other agent, we focus on the specific setting of efficiently transferring an existing sub-optimal policy to a standalone valuebased RL agent.

πŸ”Ή A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning πŸ‘ πŸ”₯

DAgger (Dataset Aggregation): trains a deterministic policy that achieves good performance guarantees under its induced distribution of states.

πŸ”Ή Multifidelity Reinforcement Learning with Control Variates πŸ”₯

MFMCRL: a multifidelity estimator that exploits the cross-correlations between the low- and high-fidelity returns is proposed to reduce the variance in the estimation of the state-action value function.

πŸ”Ή Robust Trajectory Prediction against Adversarial Attacks πŸ‘

we propose an adversarial training framework with three main components, including (1) a deterministic attack for the inner maximization process of the adversarial training, (2) additional regularization terms for stable outer minimization of adversarial training, and (3) a domain-specific augmentation strategy to achieve a better performance trade-off on clean and adversarial data.

πŸ”Ή Model-based Trajectory Stitching for Improved Offline Reinforcement Learning πŸ”₯

TS: A stitching event consists of a transition between a pair of observed states through a synthetic and highly probable action.

πŸ”Ή BATS: Best Action Trajectory Stitching πŸ”₯

BATS: we narrow the pool of candidate stitches to those that are both feasible and impactful.

πŸ”Ή TRANSFER RL VIA THE UNDO MAPS FORMALISM πŸ”₯

TvD: characterizing the discrepancy in environments by means of (potentially complex) transformation between their state spaces, and thus posing the problem of transfer as learning to undo this transformation.

πŸ”Ή Provably Sample-Efficient RL with Side Information about Latent Dynamics

TASID:

β­• β­• β­• Out-of-Distribution (OOD) Generalization Modularity--->Generalization

πŸ”Ή Invariant Risk Minimization Introduction is good! πŸ‘ πŸ”₯ πŸ’₯ slide information theoretic view

To learn invariances across environments, find a data representation such that the optimal classifier on top of that representation matches for all environments.

πŸ”Ή Out-of-Distribution Generalization via Risk Extrapolation πŸ‘ πŸ”₯

REx can be viewed as encouraging robustness over affine combinations of training risks, by encouraging strict equality between training risks.

πŸ”Ή OUT-OF-DISTRIBUTION GENERALIZATION ANALYSIS VIA INFLUENCE FUNCTION πŸ”₯

if a learnt model fΞΈΛ† manage to simultaneously achieve small VΞ³Λ†|ΞΈΛ† and high accuracy over E_test, it should have good OOD accuracy.

πŸ”Ή EMPIRICAL OR INVARIANT RISK MINIMIZATION? A SAMPLE COMPLEXITY PERSPECTIVE πŸ’§ ​

πŸ”Ή Invariant Rationalization πŸ‘ πŸ”₯ πŸ’₯

MMI can be problematic because it picks up spurious correlations between the input features and the output. Instead, we introduce a game-theoretic invariant rationalization criterion where the rationales are constrained to enable the same predictor to be optimal across different environments.

πŸ”Ή Invariant Risk Minimization Games πŸ’§ ​

​ ​

β­• influence function

  • πŸ”Ή Understanding Black-box Predictions via Influence Functions πŸ”₯ πŸ’§

    Upweighting a training point; Perturbing a training input; Efficiently calculating influence πŸ’§ ;

    • πŸ”Ή INFLUENCE FUNCTIONS IN DEEP LEARNING ARE FRAGILE πŸ‘ πŸ”₯ πŸ‘

      non-convexity of the loss function --- different initializations; parameters might be very large --- substantial Taylor’s approximation error of the loss function; computationally very expensive --- approximate inverse-Hessian Vector product techniques which might be erroneous; different architectures can have different loss landscape geometries near the optimal model parameters, leading to varying influence estimates.

    • πŸ”Ή On the Accuracy of Influence Functions for Measuring Group Effects πŸ‘ ​

      when measuring the change in test prediction or test loss, influence is additive.

    β­• do-calculate ---> causual inference (Interventions) ---> counterfactuals

    see inFERENCe's blog πŸ‘ πŸ”₯ πŸ’₯ the intervention conditional p(y|do(X=x^))p(y|do(X=x^)) is the average of counterfactuals over the obserevable population.

πŸ”Ή Soft-Robust Actor-Critic Policy-Gradient πŸ˜• ​

Robust RL has shown that by considering the worst case scenario, robust policies can be overly conservative. Soft-Robust Actor Critic (SR-AC) learns an optimal policy with respect to a distribution over an uncertainty set and stays robust to model uncertainty but avoids the conservativeness of robust strategies.

πŸ”Ή A Game-Theoretic Perspective of Generalization in Reinforcement Learning πŸ”₯ πŸ”₯

We propose a game-theoretic framework for the generalization in reinforcement learning, named GiRL, where an RL agent is trained against an adversary over a set of tasks, where the adversary can manipulate the distributions over tasks within a given threshold.

πŸ”Ή UNSUPERVISED TASK CLUSTERING FOR MULTI-TASK REINFORCEMENT LEARNING πŸ‘ πŸ”₯

EM-Task-Clustering: We propose a general approach to automatically cluster together similar tasks during training. Our method, inspired by the expectation-maximization algorithm, succeeds at finding clusters of related tasks and uses these to improve sample complexity.

πŸ”Ή Learning Dynamics and Generalization in Reinforcement Learning πŸ”₯

TD learning dynamics discourage interference, and that while this may have a beneficial effect on stability during training, it can reduce the ability of the network to generalize to new observations.

IL (IRL)

πŸ”Ή Reinforced Imitation Learning by Free Energy Principle πŸ’§

πŸ”Ή Error Bounds of Imitating Policies and Environments πŸŒ‹ πŸ’¦

πŸ”Ή What Matters for Adversarial Imitation Learning?

πŸ”Ή Distributionally Robust Imitation Learning πŸ‘ πŸ”₯ πŸ’§

This paper studies Distributionally Robust Imitation Learning (DROIL) and establishes a close connection between DROIL and Maximum Entropy Inverse Reinforcement Learning.

πŸ”Ή Provable Representation Learning for Imitation Learning via Bi-level Optimization

πŸ”Ή Provable Representation Learning for Imitation with Contrastive Fourier Features πŸ”₯ πŸ’₯

We derive a representation learning objective that provides an upper bound on the performance difference between the target policy and a lowdimensional policy trained with max-likelihood, and this bound is tight regardless of whether the target policy itself exhibits low-dimensional structure.

πŸ”Ή TRAIL: NEAR-OPTIMAL IMITATION LEARNING WITH SUBOPTIMAL DATA πŸ”₯ πŸŒ‹

TRAIL (Transition-Reparametrized Actions for Imitation Learning): We present training objectives that use offline datasets to learn a factored transition model whose structure enables the extraction of a latent action space. Our theoretical analysis shows that the learned latent action space can boost the sample-efficiency of downstream imitation learning, effectively reducing the need for large near-optimal expert datasets through the use of auxiliary non-expert data.

πŸ”Ή Imitation Learning via Differentiable Physics πŸ”₯

ILD: incorporates the differentiable physics simulator as a physics prior into its computational graph for policy learning.

πŸ”Ή Of Moments and Matching: A Game-Theoretic Framework for Closing the Imitation Gap πŸŒ‹

AdVIL, AdRIL, and DAeQuIL:

πŸ”Ή Generalizable Imitation Learning from Observation via Inferring Goal Proximity πŸ‘ πŸ”₯

we learn a goal proximity function (task proress) and utilize it as a dense reward for policy learning.

πŸ”Ή Show me the Way: Intrinsic Motivation from Demonstrations 😢

extracting an intrinsic bonus from the demonstrations.

πŸ”Ή Out-of-Dynamics Imitation Learning from Multimodal Demonstrations πŸ”₯

OOD-IL enables imitation learning to utilize demonstrations from a wide range of demonstrators but introduces a new challenge: some demonstrations cannot be achieved by the imitator due to the different dynamics. develop a better transferability measurement.

πŸ”Ή Imitating Latent Policies from Observation πŸ”₯ πŸŒ‹ πŸ’₯

ILPO: We introduce a method that characterizes the causal effects of latent actions on observations while simultaneously predicting their likelihood. We then outline an action alignment procedure that leverages a small amount of environment interactions to determine a mapping between the latent and real-world actions.

πŸ”Ή Latent Policies for Adversarial Imitation Learning πŸ‘

We use an action encoder-decoder model to obtain a low-dimensional latent action space and train a LAtent Policy using Adversarial imitation Learning (LAPAL).

πŸ”Ή A Ranking Game for Imitation Learning

πŸ”Ή Recent Advances in Imitation Learning from Observation

πŸ”Ή PREFERENCES IMPLICIT IN THE STATE OF THE WORLD πŸ”₯ πŸ’§

RLSP: we identify the state of the world at initialization as a source of information about human preferences. Second, we leverage this insight to derive an algorithm, Reward Learning by Simulating the Past (RLSP), which infers reward from initial state based on a Maximum Causal Entropy.

πŸ”Ή Population-Guided Imitation Learning πŸ‘

πŸ”Ή Towards Learning to Imitate from a Single Video Demonstration πŸ‘ πŸ”₯

using contrastive training to learn a reward function comparing an agent’s behaviour with a single demonstration.

πŸ”Ή Concurrent Training Improves the Performance of Behavioral Cloning from Observation πŸ”₯

BCO* (behavioral cloning from observation)

πŸ”Ή Identifiability and Generalizability from Multiple Experts in Inverse Reinforcement Learning

Reward Identifiability

πŸ”Ή Improving Policy Learning via Language Dynamics Distillation

LDD: pretrains a model to predict environment dynamics given demonstrations with language descriptions, and then fine-tunes these language-aware pretrained representations via reinforcement learning.

πŸ”Ή LEARNING CONTROL BY ITERATIVE INVERSION πŸ”₯

Iterative Inversion (IT-IN): Our input is a set of demonstrations of desired behavior, given as video embeddings of trajectories (without actions), and our method iteratively learns to imitate trajectories generated by the current policy, perturbed by random exploration noise.

πŸ”Ή CEIP: Combining Explicit and Implicit Priors for Reinforcement Learning with Demonstrations πŸ‘

CEIP exploits multiple implicit priors in the form of normalizing flows in parallel to form a single complex prior. Moreover, CEIP uses an effective explicit retrieval and push-forward mechanism to condition the implicit priors.

πŸ”Ή Planning for Sample Efficient Imitation Learning πŸ”₯ πŸ”₯

EfficientImitate (EI): we show the seemingly incompatible two classes of imitation algorithms (BC and AIL) can be naturally unified under our framework, enjoying the benefits of both.

πŸ”Ή Robust Imitation via Mirror Descent Inverse Reinforcement Learning

MD-AIRL:

πŸ”Ή Learning and Retrieval from Prior Data for Skill-based Imitation Learning πŸ”₯

Skill-Augmented Imitation Learning with prior Retrieval (SAILOR)

πŸ”Ή LS-IQ: IMPLICIT REWARD REGULARIZATION FOR INVERSE REINFORCEMENT LEARNING

  • Adding Noise

    πŸ”Ή Learning from Suboptimal Demonstration via Self-Supervised Reward Regression πŸ‘ πŸ”₯

    Recent attempts to learn from sub-optimal demonstration leverage pairwise rankings and following the Luce-Shepard rule. However, we show these approaches make incorrect assumptions and thus suffer from brittle, degraded performance. We overcome these limitations in developing a novel approach that bootstraps off suboptimal demonstrations to synthesize optimality-parameterized data to train an idealized reward function.

    πŸ”Ή Robust Imitation Learning from Noisy Demonstrations πŸ”₯ πŸŒ‹

    In this paper, we first theoretically show that robust imitation learning can be achieved by optimizing a classification risk with a symmetric loss. Based on this theoretical finding, we then propose a new imitation learning method that optimizes the classification risk by effectively combining pseudo-labeling with co-training.

    πŸ”Ή Imitation Learning from Imperfect Demonstration πŸ”₯ πŸŒ‹ πŸ‘ ​

    a novel approach that utilizes confidence scores, which describe the quality of demonstrations. two-step importance weighting imitation learning (2IWIL) and generative adversarial imitation learning with imperfect demonstration and confidence (IC-GAIL), based on the idea of reweighting.

    πŸ”Ή Variational Imitation Learning with Diverse-quality Demonstrations πŸ”₯ πŸ’§

    VILD: We show that simple quality-estimation approaches might fail due to compounding error, and fix this issue by jointly estimating both the quality and reward using a variational approach.

    πŸ”Ή BEHAVIORAL CLONING FROM NOISY DEMONSTRATIONS πŸŒ‹ πŸ’¦

    we propose an imitation learning algorithm to address the problem without any environment interactions and annotations associated with the non-optimal demonstrations.

    πŸ”Ή Robust Imitation Learning from Corrupted Demonstrations πŸ”₯ πŸŒ‹

    We propose a novel robust algorithm by minimizing a Median-of-Means (MOM) objective which guarantees the accurate estimation of policy, even in the presence of constant fraction of outliers.

    πŸ”Ή Confidence-Aware Imitation Learning from Demonstrations with Varying Optimality πŸ‘ πŸ”₯ πŸŒ‹

    CAIL: learns a well-performing policy from confidence-reweighted demonstrations, while using an outer loss to track the performance of our model and to learn the confidence.

    πŸ”Ή Imitation Learning by Estimating Expertise of Demonstrators πŸ”₯ πŸŒ‹

    ILEED: We develop and optimize a joint model over a learned policy and expertise levels of the demonstrators. This enables our model to learn from the optimal behavior and filter out the suboptimal behavior of each demonstrator.

    πŸ”Ή Learning to Weight Imperfect Demonstrations πŸŒ‹

    We provide a rigorous mathematical analysis, presenting that the weights of demonstrations can be exactly determined by combining the discriminator and agent policy in GAIL.

    πŸ”Ή Robust Adversarial Imitation Learning via Adaptively-Selected Demonstrations πŸ”₯

    SAIL: good demonstrations can be adaptively selected for training while bad demonstrations are abandoned.

    πŸ”Ή Policy Learning Using Weak Supervision πŸŒ‹ πŸ”₯

    PeerRL: We treat the β€œweak supervision” as imperfect information coming from a peer agent, and evaluate the learning agent’s policy based on a β€œcorrelated agreement” with the peer agent’s policy (instead of simple agreements).

πŸ”Ή Rethinking Importance Weighting for Transfer Learning πŸŒ‹

We review recent advances based on joint and dynamic importance predictor estimation. Furthermore, we introduce a method of causal mechanism transfer that incorporates causal structure in TL.

πŸ”Ή Inverse Decision Modeling: Learning Interpretable Representations of Behavior πŸ”₯ πŸŒ‹

We develop an expressive, unifying perspective on inverse decision modeling: a framework for learning parameterized representations of sequential decision behavior.

πŸ”Ή DISCRIMINATOR-ACTOR-CRITIC: ADDRESSING SAMPLE INEFFICIENCY AND REWARD BIAS IN ADVERSARIAL IMITATION LEARNING 😢

DAC: To address reward bias, we propose a simple mechanism whereby the rewards for absorbing states are also learned; To improve sample efficiency, we perform off-policy training.

πŸ”Ή Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations πŸ‘

T-REX: a reward learning technique for high-dimensional tasks that can learn to extrapolate intent from suboptimal ranked demonstrations.

πŸ”Ή Better-than-Demonstrator Imitation Learning via Automatically-Ranked Demonstrations πŸ‘

D-REX: a ranking-based reward learning algorithm that does not require ranked demonstrations, which injects noise into a policy learned through behavioral cloning to automatically generate ranked demonstrations.

πŸ”Ή DART: Noise Injection for Robust Imitation Learning πŸ”₯

We propose an off-policy approach that injects noise into the supervisor’s policy while demonstrating. This forces the supervisor to demonstrate how to recover from errors. We propose a new algorithm, DART (Disturbances for Augmenting Robot Trajectories), that collects demonstrations with injected noise, and optimizes the noise level to approximate the error of the robot’s trained policy during data collection.

πŸ”Ή Bayesian Inverse Reinforcement Learning

πŸ”Ή Deep Bayesian Reward Learning from Preferences

B-REX: Our approach uses successor feature representations and preferences over demonstrations to efficiently generate samples from the posterior distribution over the demonstrator’s reward function without requiring an MDP solver.

πŸ”Ή Safe Imitation Learning via Fast Bayesian Reward Inference from Preferences πŸ”₯

Bayesian REX (B-REX)

πŸ”Ή Asking Easy Questions: A User-Friendly Approach to Active Reward Learning πŸ”₯ πŸ‘

we explore an information gain formulation for optimally selecting questions that naturally account for the human’s ability to answer. Our approach identifies questions that optimize the trade-off between robot and human uncertainty, and determines when these questions become redundant or costly. + Volume Removal Solution

πŸ”Ή Few-Shot Preference Learning for Human-in-the-Loop RL πŸ”₯ πŸ”₯

We pre-train preference models on prior task data and quickly adapt them for new tasks using only a handful of queries.

πŸ”Ή A Ranking Game for Imitation Learning πŸ”₯ πŸ”₯

The rankinggame additionally affords a broader perspective of imitation, going beyond using only expert demonstrations, and utilizing rankings/preferences over suboptimal behaviors.

πŸ”Ή Learning Multimodal Rewards from Rankings πŸ”₯

We formulate the multimodal reward learning as a mixture learning problem and develop a novel ranking-based learning approach, where the experts are only required to rank a given set of trajectories.

πŸ”Ή Semi-Supervised Imitation Learning of Team Policies from Suboptimal Demonstrations

BTIL:

πŸ”Ή Learning Reward Functions from Scale Feedback πŸ‘

Instead of a strict question on which of the two proposed trajectories the user prefers, we allow for more nuanced feedback using a slider bar.

πŸ”Ή Interactive Learning from Policy-Dependent Human Feedback

COACH:

πŸ”Ή Towards Sample-efficient Apprenticeship Learning from Suboptimal Demonstration 😢

SSRR, S3RR: noise-performance curve fitting --> regresses a reward function of trajectory states and actions.

πŸ”Ή BASIS FOR INTENTIONS: EFFICIENT INVERSE REINFORCEMENT LEARNING USING PAST EXPERIENCE πŸ”₯ πŸŒ‹

BASIS, which leverages multi-task RL pre-training and successor features to allow an agent to build a strong basis for intentions that spans the space of possible goals in a given domain.

πŸ”Ή POSITIVE-UNLABELED REWARD LEARNING πŸ”₯ πŸŒ‹

PURL: we connect these two classes of reward learning methods (GAIL, SL) to positiveunlabeled (PU) learning, and we show that by applying a large-scale PU learning algorithm to the reward learning problem, we can address both the reward underand over-estimation problems simultaneously.

πŸ”Ή Combating False Negatives in Adversarial Imitation Learning πŸ‘

Fake Conditioning

πŸ”Ή Task-Relevant Adversarial Imitation Learning πŸ”₯ πŸ”₯

TRAIL proposes to constrain the GAIL discriminator such that it is not able to distinguish between certain, preselected expert and agent observations which do not contain task behavior.

πŸ”Ή Environment Design for Inverse Reinforcement Learning πŸ”₯ πŸ”₯

We formalise a framework for this environment design process in which learner and expert repeatedly interact, and construct algorithms that actively seek information about the rewards by carefully curating environments for the human to demonstrate the task in.

πŸ”Ή Reward Identification in Inverse Reinforcement Learning

πŸ”Ή Identifiability in inverse reinforcement learning

Offline RL

πŸ”Ή Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems​ πŸ’₯ πŸ’₯ ​ πŸ’§ ​

Offline RL with dynamic programming: distributional shift; policy constraints; uncertainty estimation; conservative Q-learning and Pessimistic Value-function;

https://danieltakeshi.github.io/2020/06/28/offline-rl/ πŸ’¦

https://ai.googleblog.com/2020/08/tackling-open-challenges-in-offline.html πŸ’¦

https://sites.google.com/view/offlinerltutorial-neurips2020/home πŸ’¦

πŸ”Ή D4RL: DATASETS FOR DEEP DATA-DRIVEN REINFORCEMENT LEARNING πŸŒ‹ πŸŒ‹

examples of such properties include: datasets generated via hand-designed controllers and human demonstrators, multitask datasets where an agent performs different tasks in the same environment, and datasets collected with mixtures of policies.

πŸ”Ή d3rlpy: An Offline Deep Reinforcement Learning Library πŸŒ‹

πŸ”Ή A Survey on Offline Reinforcement Learning: Taxonomy, Review, and Open Problems πŸ’§

πŸ”Ή An Optimistic Perspective on Offline Reinforcement Learning πŸ‘ ​

To enhance generalization in the offline setting, we present Random Ensemble Mixture (REM), a robust Q-learning algorithm that enforces optimal Bellman consistency on random convex combinations of multiple Q-value estimates.

πŸ”Ή OPAL: OFFLINE PRIMITIVE DISCOVERY FOR ACCELERATING OFFLINE REINFORCEMENT LEARNING πŸ’₯ when presented with offline data composed of a variety of behaviors, an effective way to leverage this data is to extract a continuous space of recurring and temporally extended primitive behaviors before using these primitives for downstream task learning. OFFLINE unsupervised RL.

πŸ”Ή Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization πŸ‘ πŸŒ‹ πŸ’₯

πŸ”Ή TOWARDS DEPLOYMENT-EFFICIENT REINFORCEMENT LEARNING: LOWER BOUND AND OPTIMALITY πŸ˜•

We propose such a formulation for deployment-efficient RL (DE-RL) from an β€œoptimization with constraints” perspective: we are interested in exploring an MDP and obtaining a near-optimal policy within minimal deployment complexity, whereas in each deployment the policy can sample a large batch of data.

πŸ”Ή MUSBO: Model-based Uncertainty Regularized and Sample Efficient Batch Optimization for Deployment Constrained Reinforcement Learning πŸ”₯ πŸ‘

Our framework discovers novel and high quality samples for each deployment to enable efficient data collection. During each offline training session, we bootstrap the policy update by quantifying the amount of uncertainty within our collected data.

πŸ”Ή BENCHMARKS FOR DEEP OFF-POLICY EVALUATION πŸ‘

πŸ”Ή KEEP DOING WHAT WORKED: BEHAVIOR MODELLING PRIORS FOR OFFLINE REINFORCEMENT LEARNING πŸ‘ πŸ”₯ πŸ’₯

It admits the use of data generated by arbitrary behavior policies and uses a learned prior – the advantage-weighted behavior model (ABM) – to bias the RL policy towards actions that have previously been executed and are likely to be successful on the new task.

extrapolation or bootstrapping errors: (Fujimoto et al., 2018; Kumar et al., 2019)

πŸ”Ή Off-Policy Deep Reinforcement Learning without Exploration πŸ”₯ πŸ’₯ πŸŒ‹ ​ ​ ​

BCQ: We introduce a novel class of off-policy algorithms, batch-constrained reinforcement learning, which restricts the action space in order to force the agent towards behaving close to on-policy with respect to a subset of the given data.

πŸ”Ή Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction πŸ”₯ πŸ’₯ πŸ’§ ​

BEAR: We identify bootstrapping error as a key source of instability in current methods. Bootstrapping error is due to bootstrapping from actions that lie outside of the training data distribution, and it accumulates via the Bellman backup operator.

πŸ”Ή Conservative Q-Learning for Offline Reinforcement Learning πŸ‘ πŸ”₯ πŸŒ‹ πŸ’¦ πŸ’₯

conservative Q-learning (CQL), which aims to address these limitations by learning a conservative Q-function such that the expected value of a policy under this Q-function lower-bounds its true value.

πŸ”Ή Mildly Conservative Q-Learning for Offline Reinforcement Learning πŸ‘ πŸ”₯ πŸŒ‹

We propose Mildly Conservative Q-learning (MCQ), where OOD actions are actively trained by assigning them proper pseudo Q values.

πŸ”Ή Constraints Penalized Q-Learning for Safe Offline Reinforcement Learning πŸ‘ πŸ”₯ πŸŒ‹

We show that naΓ―ve approaches that combine techniques from safe RL and offline RL can only learn sub-optimal solutions. We thus develop a simple yet effective algorithm, Constraints Penalized Q-Learning (CPQ), to solve the problem.

πŸ”Ή Conservative Offline Distributional Reinforcement Learning πŸ’¦

CODAC:

πŸ”Ή OFFLINE REINFORCEMENT LEARNING HANDS-ON

πŸ”Ή Supervised Off-Policy Ranking πŸ‘

SOPR: aims to rank a set of target policies based on supervised learning by leveraging off-policy data and policies with known performance. [poster]

πŸ”Ή Conservative Data Sharing for Multi-Task Offline Reinforcement Learning πŸ”₯ πŸŒ‹ πŸ’₯

Conservative data sharing (CDS): We develop a simple technique for data-sharing in multi-task offline RL that routes data based on the improvement over the task-specific data.

πŸ”Ή UNCERTAINTY-BASED MULTI-TASK DATA SHARING FOR OFFLINE REINFORCEMENT LEARNING πŸ”₯ πŸ”₯ πŸŒ‹

UTDS: suboptimality gap of UTDS is related to the expected uncertainty of the shared dataset. (CDS)

πŸ”Ή Data Sharing without Rewards in Multi-Task Offline Reinforcement Learning

Conservative unsupervised data sharing (CUDS): under a binary-reward assumption, simply utilizing data from other tasks with constant reward labels can not only provide substantial improvement over only using the single-task data and previously proposed success classifiers, but it can also reach comparable performance to baselines that take advantage of the oracle multi-task reward information.

πŸ”Ή Switch Trajectory Transformer with Distributional Value Approximation for Multi-Task Reinforcement Learning

πŸ”Ή How to Leverage Unlabeled Data in Offline Reinforcement Learning πŸ”₯ πŸŒ‹

We provide extensive theoretical and empirical analysis that illustrates how it trades off reward bias, sample complexity and distributional shift, often leading to good results. We characterize conditions under which this simple strategy is effective, and further show that extending it with a simple reweighting approach can further alleviate the bias introduced by using incorrect reward labels.

πŸ”Ή PROVABLE UNSUPERVISED DATA SHARING FOR OFFLINE REINFORCEMENT LEARNING πŸ”₯ πŸŒ‹

PDS utilizes additional penalties upon the reward function learned from labeled data to avoid potential overestimation of the reward.

πŸ”Ή Is Pessimism Provably Efficient for Offline RL? πŸ”₯ πŸŒ‹ πŸ”₯

Pessimistic value iteration algorithm (PEVI): incorporates a penalty function (pessimism) into the value iteration algorithm. The penalty function simply flips the sign of the bonus function (optimism) for promoting exploration in online RL. We decompose the suboptimality of any policy into three sources: the spurious correlation, intrinsic uncertainty, and optimization error.

πŸ”Ή PESSIMISTIC MODEL-BASED OFFLINE REINFORCEMENT LEARNING UNDER PARTIAL COVERAGE πŸŒ‹

Constrained Pessimistic Policy Optimization (CPPO): We study model-based offline RL with function approximation under partial coverage. We show that for the model-based setting, realizability in function class and partial coverage together are enough to learn a policy that is comparable to any policies covered by the offline distribution.

πŸ”Ή Corruption-Robust Offline Reinforcement Learning πŸ˜•

πŸ”Ή Bellman-consistent Pessimism for Offline Reinforcement Learning πŸ”₯ πŸ˜•

We introduce the notion of Bellman-consistent pessimism for general function approximation: instead of calculating a point-wise lower bound for the value function, we implement pessimism at the initial state over the set of functions consistent with the Bellman equations.

πŸ”Ή Provably Good Batch Reinforcement Learning Without Great Exploration πŸ”₯ πŸ”₯

We show that a small modification to Bellman optimality and evaluation back-up to take a more conservative update can have much stronger guarantees. In certain settings, they can find the approximately best policy within the state-action space explored by the batch data, without requiring a priori assumptions of concentrability.

πŸ”Ή Pessimistic Q-Learning for Offline Reinforcement Learning: Towards Optimal Sample Complexity πŸ”₯ πŸ’§

LCB-Q (value iteration with lower confidence bounds): We study a pessimistic variant of Q-learning in the context of finite-horizon Markov decision processes, and characterize its sample complexity under the single policy concentrability assumption which does not require the full coverage of the state-action space.

πŸ”Ή Policy Finetuning: Bridging Sample-Efficient Offline and Online Reinforcement Learning πŸ”₯ πŸ’§

This paper initiates the theoretical study of policy finetuning, that is, online RL where the learner has additional access to a β€œreference policy” \miu close to the optimal policy \pi⋆ in a certain sense.

πŸ”Ή Towards Instance-Optimal Offline Reinforcement Learning with Pessimism

πŸ”Ή Provable Benefits of Actor-Critic Methods for Offline Reinforcement Learning

Pessimistic Actor Critic for Learning without Exploration (PACLE)

πŸ”Ή WHEN SHOULD OFFLINE REINFORCEMENT LEARNING BE PREFERRED OVER BEHAVIORAL CLONING? πŸŒ‹ πŸŒ‹

under what environment and dataset conditions can an offline RL method outperform BC with an equal amount of expert data, even when BC is a natural choice? [Should I Run Offline Reinforcement Learning or Behavioral Cloning?]

πŸ”Ή PESSIMISTIC BOOTSTRAPPING FOR UNCERTAINTY-DRIVEN OFFLINE REINFORCEMENT LEARNING πŸ‘ πŸ”₯

PBRL: We propose Pessimistic Bootstrapping for offline RL (PBRL), a purely uncertainty-driven offline algorithm without explicit policy constraints. Specifically, PBRL conducts uncertainty quantification via the disagreement of bootstrapped Q-functions, and performs pessimistic updates by penalizing the value function based on the estimated uncertainty.

πŸ”Ή UNCERTAINTY REGULARIZED POLICY LEARNING FOR OFFLINE REINFORCEMENT LEARNING πŸ‘

Uncertainty Regularized Policy Learning (URPL): URPL adds an uncertainty regularization term in the policy learning objective to enforce to learn a more stable policy under the offline setting. Moreover, we further use the uncertainty regularization term as a surrogate metric indicating the potential performance of a policy.

πŸ”Ή Model-Based Offline Meta-Reinforcement Learning with Regularization πŸ‘ πŸ”₯ πŸŒ‹

We explore model-based offline Meta-RL with regularized Policy Optimization (MerPO), which learns a meta-model for efficient task structure inference and an informative meta-policy for safe exploration of out-of-distribution state-actions.

πŸ”Ή BATCH REINFORCEMENT LEARNING THROUGH CONTINUATION METHOD πŸ”₯

We propose a simple yet effective approach, soft policy iteration algorithm through continuation method to alleviate two challenges in policy optimization under batch reinforcement learning: (1) highly non-smooth objective function which is difficult to optimize (2) high variance in value estimates.

πŸ”Ή SCORE: SPURIOUS CORRELATION REDUCTION FOR OFFLINE REINFORCEMENT LEARNING πŸ‘ πŸ”₯ πŸ”₯

We propose a practical and theoretically guaranteed algorithm SCORE that reduces spurious correlations by combing an uncertainty penalty into policy evaluation. We show that this is consistent with the pessimism principle studied in theory, and the proposed algorithm converges to the optimal policy with a sublinear rate under mild assumptions.

πŸ”Ή Why so pessimistic? Estimating uncertainties for offline rl through ensembles, and why their independence matters πŸ‘

our proposed MSG algorithm advocates for using independently learned ensembles, without sharing of target values, and this import design decision is supported by empirical evidence.

πŸ”Ή S4RL: Surprisingly Simple Self-Supervision for Offline Reinforcement Learning 😢

utilizes data augmentations from states to learn value functions that are better at generalizing and extrapolating when deployed in the environment.

πŸ”Ή Actionable Models: Unsupervised Offline Reinforcement Learning of Robotic Skills πŸ‘ πŸ”₯ πŸ’₯

learning a functional understanding of the environment by learning to reach any goal state in a given dataset. We employ goal-conditioned Qlearning with hindsight relabeling and develop several techniques that enable training in a particularly challenging offline setting.

πŸ”Ή Behavior Regularized Offline Reinforcement Learning πŸ”₯ πŸ’₯ πŸ‘

we introduce a general framework, behavior regularized actor critic (BRAC), to empirically evaluate recently proposed methods as well as a number of simple baselines across a variety of offline continuous control tasks.

πŸ”Ή BRAC+: Improved Behavior Regularized Actor Critic for Offline Reinforcement Learning πŸ”₯

We improved the behavior regularized offline RL by proposing a low-variance upper bound of the KL divergence estimator to reduce variance and gradient penalized policy evaluation such that the learned Q functions are guaranteed to converge.

πŸ”Ή Offline-to-Online Reinforcement Learning via Balanced Replay and Pessimistic Q-Ensemble 😢

we observe that state-action distribution shift may lead to severe bootstrap error during fine-tuning, which destroys the good initial policy obtained via offline RL.

πŸ”Ή Experience Replay with Likelihood-free Importance Weights πŸ”₯ πŸŒ‹

To balance bias (from off-policy experiences) and variance (from on-policy experiences), we use a likelihood-free density ratio estimator between onpolicy and off-policy experiences, and use the learned ratios as the prioritization weights.

πŸ”Ή MOORe: Model-based Offline-to-Online Reinforcement Learning πŸ”₯

employs a prioritized sampling scheme that can dynamically adjust the offline and online data for smooth and efficient online adaptation of the policy.

πŸ”Ή Offline Meta-Reinforcement Learning with Online Self-Supervision πŸ‘ πŸ”₯

Unlike the online setting, the adaptation and exploration strategies cannot effectively adapt to each other, resulting in poor performance. we propose a hybrid offline meta-RL algorithm, which uses offline data with rewards to meta-train an adaptive policy, and then collects additional unsupervised online data, without any ground truth reward labels, to bridge this distribution shift problem.

πŸ”Ή Offline Meta-Reinforcement Learning with Advantage Weighting πŸ”₯ ​

Targeting the offline meta-RL setting, we propose Meta-Actor Critic with Advantage Weighting (MACAW), an optimization-based meta-learning algorithm that uses simple, supervised regression objectives for both the inner and outer loop of meta-training.

πŸ”Ή Robust Task Representations for Offline Meta-Reinforcement Learning via Contrastive Learning πŸ”₯ πŸ”₯

CORRO: which decreases the influence of behavior policies on task representations while supporting tasks that differ in reward function and transition dynamics.

πŸ”Ή AWAC: Accelerating Online Reinforcement Learning with Offline Datasets πŸ‘ πŸ”₯ πŸŒ‹

we systematically analyze why this problem (offline + online) is so challenging, and propose an algorithm that combines sample efficient dynamic programming with maximum likelihood policy updates, providing a simple and effective framework that is able to leverage large amounts of offline data and then quickly perform online fine-tuning of RL policies.

πŸ”Ή Guiding Online Reinforcement Learning with Action-Free Offline Pretraining 😢

AF-Guide consists of an Action-Free Decision Transformer (AFDT) implementing a variant of Upside-Down Reinforcement Learning. It learns to plan the next states from the offline dataset, and a Guided Soft Actor-Critic (Guided SAC) that learns online with guidance from AFDT.

πŸ”Ή Critic Regularized Regression πŸ‘ πŸ”₯ πŸŒ‹ ​ ​

CRR: Our alg. can be seen as a form of filtered behavioral cloning where data is selected based on information contained in the policy’s Q-fun. we do not rely on observed returns for adv. estimation.

πŸ”Ή Exponentially Weighted Imitation Learning for Batched Historical Data πŸ‘ πŸ”₯ πŸŒ‹ ​ ​

MARWIL: we propose a monotonic advantage reweighted imitation learning strategy that is applicable to problems with complex nonlinear function approximation and works well with hybrid (discrete and continuous) action space.

πŸ”Ή BAIL: Best-Action Imitation Learning for Batch Deep Reinforcement Learning πŸ‘ πŸŒ‹ ​

BAIL learns a V function, uses the V function to select actions it believes to be high-performing, and then uses those actions to train a policy network using imitation learning.

πŸ”Ή Offline RL Without Off-Policy Evaluation πŸ’§ πŸ”₯

a unified algorithmic template for offline RL algorithms as offline approximate modified policy iteration.

πŸ”Ή MODEL-BASED OFFLINE PLANNING πŸ”₯

MBOP: Learning dynamics, action priors, and values; MBOP-Policy; MBOP-Trajopt.

πŸ”Ή Model-Based Offline Planning with Trajectory Pruning πŸ‘ πŸ”₯

MOPP: MOPP avoids over-restrictive planning while enabling offline learning by encouraging more aggressive trajectory rollout guided by the learned behavior policy, and prunes out problematic trajectories by evaluating the uncertainty of the dynamics model.

πŸ”Ή Model-based Offline Policy Optimization with Distribution Correcting Regularization πŸ‘ πŸŒ‹

DROP (density ratio regularized offline policy learning ) estimates the density ratio between model-rollouts distribution and offline data distribution via the DICE framework, and then regularizes the model predicted rewards with the ratio for pessimistic policy learning.

πŸ”Ή A Minimalist Approach to Offline Reinforcement Learning πŸ‘ πŸ”₯ πŸŒ‹ ​ ​ ​

We find that we can match the performance of state-of-the-art offline RL algorithms by simply adding a behavior cloning term to the policy update of an online RL algorithm and normalizing the data.

πŸ”Ή POPO: Pessimistic Offline Policy Optimization πŸ˜• ​

Distributional value functions.

πŸ”Ή Offline Reinforcement Learning as Anti-Exploration πŸ‘ πŸ”₯ ​

The core idea is to subtract a prediction-based exploration bonus from the reward, instead of adding it for exploration.

πŸ”Ή MOPO: Model-based Offline Policy Optimization πŸ‘ πŸ”₯ πŸŒ‹ πŸ’₯

we propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics. We theoretically show that the algorithm maximizes a lower bound of the policy’s return under the true MDP. We also characterize the trade-off between the gain and risk of leaving the support of the batch data.

πŸ”Ή Domain Generalization for Robust Model-Based Offline Reinforcement Learning 😢

DIMORL: Since different demonstrators induce different data distributions, we show that this can be naturally framed as a domain generalization problem, with each demonstrator corresponding to a different domain.

πŸ”Ή MOReL: Model-Based Offline Reinforcement Learning πŸŒ‹ πŸ’₯ πŸ’§

This framework consists of two steps: (a) learning a pessimistic MDP (P-MDP) using the offline dataset; (b) learning a near-optimal policy in this P-MDP.

πŸ”Ή COMBO: Conservative Offline Model-Based Policy Optimization πŸ”₯ πŸŒ‹ πŸ’₯ ​ ​ ​

This results in a conservative estimate of the value function for out-of-support state-action tuples, without requiring explicit uncertainty estimation.

πŸ”Ή Regularizing a Model-based Policy Stationary Distribution to Stabilize Offline Reinforcement Learning πŸŒ‹ πŸ”₯

SDM-GAN: we regularize the undiscounted stationary distribution of the current policy towards the offline data during the policy optimization process. [ppt]

πŸ”Ή HYBRID VALUE ESTIMATION FOR OFF-POLICY EVALUATION AND OFFLINE REINFORCEMENT LEARNING πŸŒ‹

We propose Hybrid Value Estimation (HVE) to perform a more accurate value function estimation in the offline setting. It automatically adjusts the step length parameter to get a bias-variance trade-off.

πŸ”Ή DROMO: Distributionally Robust Offline Model-based Policy Optimization πŸ”₯

To extend the basic idea of regularization without uncertainty quantification, we propose distributionally robust offline model-based policy optimization (DROMO), which leverages the ideas in distributionally robust optimization to penalize a broader range of out-of-distribution state-action pairs beyond the standard empirical out-of-distribution Q-value minimization.

πŸ”Ή Behavioral Priors and Dynamics Models: Improving Performance and Domain Transfer in Offline RL πŸŒ‹

MABE: By adaptive behavioral prior, we mean a policy that approximates the behavior in the offline dataset while giving more importance to trajectories with high rewards.

πŸ”Ή Offline Reinforcement Learning with Fisher Divergence Critic Regularization πŸ”₯ πŸŒ‹ πŸ’§

We propose using a gradient penalty regularizer for the offset term and demonstrate its equivalence to Fisher divergence regularization, suggesting connections to the score matching and generative energy-based model literature.

πŸ”Ή Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning πŸ‘ πŸ”₯ ​ ​

UWAC: an algorithm that detects OOD state-action pairs and down-weights their contribution in the training objectives accordingly.

πŸ”Ή Model-based Offline Policy Optimization with Distribution Correcting Regularization

πŸ”Ή EMaQ: Expected-Max Q-Learning Operator for Simple Yet Effective Offline and Online RL πŸ”₯ πŸŒ‹ πŸ’§

By introducing the Expect-Max Q-Learning operator, we present a novel theoretical setup that takes into account the proposal distribution Β΅(a|s) and the number of action samples N, and hence more closely matches the resulting practical algorithm.

πŸ”Ή Lyapunov Density Models: Constraining Distribution Shift in Learning-Based Control πŸ”₯ πŸŒ‹

We presented Lyapunov density models (LDMs), a tool that can ensure that an agent remains within the distribution of the training data.

πŸ”Ή OFFLINE REINFORCEMENT LEARNING WITH IN-SAMPLE Q-LEARNING πŸ”₯ πŸ”₯

We presented implicit Q-Learning (IQL), a general algorithm for offline RL that completely avoids any queries to values of out-of-sample actions during training while still enabling multi-step dynamic programming. Adopting Expectile regression. old

πŸ”Ή Continuous Doubly Constrained Batch Reinforcement Learning πŸ”₯ πŸ”₯

CDC: The first regularizer combats the extra-overestimation bias in regions that are out-of-distribution. The second regularizer is designed to hedge against the adverse effects of policy updates that severly diverge from behavior policy.

πŸ”Ή Believe What You See: Implicit Constraint Approach for Offline Multi-Agent Reinforcement Learning πŸ‘ πŸ’§

ICQ: we propose a novel offline RL algorithm, named Implicit Constraint Q-learning (ICQ), which effectively alleviates the extrapolation error by only trusting the state-action pairs given in the dataset for value estimation.

πŸ”Ή Offline Model-based Adaptable Policy Learning πŸ‘ πŸ”₯ πŸŒ‹

MAPLE tries to model all possible transition dynamics in the out-of-support regions. A context encoder RNN is trained to produce latent codes given the episode history, and the encoder and policy are jointly optimized to maximize average performance across a large ensemble of pretained dynamics models.

πŸ”Ή Supported Policy Optimization for Offline Reinforcement Learning 😢

We present Supported Policy OpTimization (SPOT), which is directly derived from the theoretical formalization of the density based support constraint. SPOT adopts a VAEbased density estimator to explicitly model the support set of behavior policy.

πŸ”Ή Weighted model estimation for offline model-based reinforcement learning πŸ”₯

This paper considers weighting with the state-action distribution ratio of offline data and simulated future data, which can be estimated relatively easily by standard density ratio estimation techniques for supervised learning.

πŸ”Ή Batch Reinforcement Learning with Hyperparameter Gradients πŸ‘ πŸ”₯ πŸŒ‹ πŸ”₯

BOPAH: Unlike prior work where this trade-off is controlled by hand-tuned hyperparameters (in a generalized KL-regularized RL framework), we propose a novel batch reinforcement learning approach, batch optimization of policy and hyperparameter (BOPAH), that uses a gradient-based optimization of the hyperparameter using held-out data.

πŸ”Ή OFFLINE REINFORCEMENT LEARNING WITH VALU-EBASED EPISODIC MEMORY πŸŒ‹ πŸ’§ πŸ”₯

We present a new offline V -learning method, EVL (expectile V -learning), and a novel offline RL framework, VEM (Value-based Episodic Memory). EVL learns the value function through the trade-offs between imitation learning and optimal value learning. VEM uses a memory-based planning scheme to enhance advantage estimation and conduct policy learning in a regression manner. IQL

πŸ”Ή Offline Reinforcement Learning with Soft Behavior Regularization πŸ”₯ πŸŒ‹

Soft Behavior-regularized Actor Critic (SBAC): we design a new behavior regularization scheme for offline RL that enables policy improvement guarantee and state-dependent policy regularization.

πŸ”Ή Offline Reinforcement Learning with Pseudometric Learning πŸ‘ πŸ”₯ πŸŒ‹

In the presence of function approximation, and under the assumption of limited coverage of the state-action space of the environment, it is necessary to enforce the policy to visit state-action pairs close to the support of logged transitions. In this work, we propose an iterative procedure to learn a pseudometric (closely related to bisimulation metrics) from logged transitions, and use it to define this notion of closeness.

πŸ”Ή Offline Reinforcement Learning with Reverse Model-based Imagination πŸ”₯ πŸ”₯

Reverse Offline Model-based Imagination (ROMI): We learn a reverse dynamics model in conjunction with a novel reverse policy, which can generate rollouts leading to the target goal states within the offline dataset.

πŸ”Ή Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble πŸ‘ πŸ”₯

SAC-N: we propose an uncertainty-based model-free offline RL method that effectively quantifies the uncertainty of the Q-value estimates by an ensemble of Q-function networks and does not require any estimation or sampling of the data distribution.

πŸ”Ή Q-Ensemble for Offline RL: Don’t Scale the Ensemble, Scale the Batch Size

πŸ”Ή ROBUST OFFLINE REINFORCEMENT LEARNING FROM LOW-QUALITY DATA 😢

AdaPT: we propose an Adaptive Policy constrainT (AdaPT) method, which allows effective exploration on out-ofdistribution actions by imposing an adaptive constraint on the learned policy.

πŸ”Ή Regularized Behavior Value Estimation πŸ‘ πŸ”₯ πŸ”₯

R-BVE uses a ranking regularisation term that favours actions in the dataset that lead to successful outcomes. CRR \ MPO.

πŸ”Ή Active Offline Policy Selection πŸ‘ πŸŒ‹ ​ ​

Gaussian process over policy values; Kernel; Active offline policy selection with Bayesian optimization. We proposed a BO solution that integrates OPE estimates with evaluations obtained by interacting with env.

πŸ”Ή Offline Policy Selection under Uncertainty πŸ’¦ ​

πŸ”Ή Offline Learning from Demonstrations and Unlabeled Experience 😢 πŸ‘ πŸ”₯

We proposed offline reinforced imitation learning (ORIL) to enable learning from both demonstrations and a large unlabeled set of experiences without reward annotations.

πŸ”Ή Discriminator-Weighted Offline Imitation Learning from Suboptimal Demonstrations πŸ”₯ πŸŒ‹ πŸ’₯

DWBC: We introduce an additional discriminator to distinguish expert and non-expert data, we propose a cooperation strategy to boost the performance of both tasks, this will result in a new policy learning objective and surprisingly, we find its equivalence to a generalized BC objective, where the outputs of discriminator serve as the weights of the BC loss function.

πŸ”Ή Discriminator-Guided Model-Based Offline Imitation Learning πŸ”₯

(DMIL) framework, which introduces a discriminator to simultaneously distinguish the dynamics correctness and suboptimality of model rollout data against real expert demonstrations.

πŸ”Ή CLARE: CONSERVATIVE MODEL-BASED REWARD LEARNING FOR OFFLINE INVERSE REINFORCEMENT LEARNING πŸŒ‹ πŸ’§

solves offline IRL efficiently via integrating β€˜conservatism’ into a learned reward function and utilizing an estimated dynamics model.

πŸ”Ή Offline Preference-Based Apprenticeship Learning πŸ‘ ​

OPAL: Given a database consisting of trajectories without reward labels, we query an expert for preference labels over trajectory segments from the database, learn a reward function from preferences, and then perform offline RL using rewards provided by the learned reward function.

πŸ”Ή Semi-supervised reward learning for offline reinforcement learning 😢 ​

We train a reward function on a pre-recorded dataset, use it to label the data and do offline RL.

πŸ”Ή LEARNING VALUE FUNCTIONS FROM UNDIRECTED STATE-ONLY EXPERIENCE πŸ”₯

This paper tackles the problem of learning value functions from undirected state only experience (state transitions without action labels i.e. (s, s' , r) tuples).

πŸ”Ή Offline Inverse Reinforcement Learning

πŸ”Ή Augmented World Models Facilitate Zero-Shot Dynamics Generalization From a Single Offline Environment πŸ‘ πŸ”₯ πŸŒ‹

We augment a learned dynamics model with simple transformations that seek to capture potential changes in physical properties of the robot, leading to more robust policies.

πŸ”Ή SEMI-PARAMETRIC TOPOLOGICAL MEMORY FOR NAVIGATION

πŸ”Ή Mapping State Space using Landmarks for Universal Goal Reaching

πŸ”Ή Search on the Replay Buffer: Bridging Planning and Reinforcement Learning

πŸ”Ή Hallucinative Topological Memory for Zero-Shot Visual Planning

πŸ”Ή Sparse Graphical Memory for Robust Planning πŸ‘

SGM: aggregates states according to a novel two-way consistency objective, adapting classic state aggregation criteria to goal-conditioned RL: two states are redundant when they are interchangeable both as goals and as starting states.

πŸ”Ή Plan2vec: Unsupervised Representation Learning by Latent Plans 😢

Plan2vec constructs a weighted graph on an image dataset using near-neighbor distances, and then extrapolates this local metric to a global embedding by distilling path-integral over planned path.

πŸ”Ή World Model as a Graph: Learning Latent Landmarks for Planning πŸ‘

L3P: We devise a novel algorithm to learn latent landmarks that are scattered (in terms of reachability) across the goal space as the nodes on the graph.

πŸ”Ή Value Memory Graph: A Graph-Structured World Model for Offline Reinforcement Learning πŸ”₯

VMG: we design a graph-structured world model in offline reinforcement learning by building a directed-graph-based Markov decision process (MDP) with rewards allocated to each directed edge as an abstraction of the original continuous environment.

πŸ”Ή GOAL-CONDITIONED BATCH REINFORCEMENT LEARNING FOR ROTATION INVARIANT LOCOMOTION 😢

πŸ”Ή Offline Meta-Reinforcement Learning for Industrial Insertion 😢

We introduced an offline meta-RL algorithm, ODA, that can meta-learn an adaptive policy from offline data, quickly adapt based on a small number of user-provided demonstrations for a new task, and then further adapt through online finetuning.

πŸ”Ή Scaling data-driven robotics with reward sketching and batch reinforcement learning

πŸ”Ή OFFLINE RL WITH RESOURCE CONSTRAINED ONLINE DEPLOYMENT πŸ‘

Resourceconstrained setting: We highlight the performance gap between policies trained using the full offline dataset and policies trained using limited features.

πŸ”Ή Reinforcement Learning from Imperfect Demonstrations πŸ”₯

We propose Normalized Actor-Critic (NAC) that effectively normalizes the Q-function, reducing the Q-values of actions unseen in the demonstration data. NAC learns an initial policy network from demonstrations and refines the policy in the environment.

πŸ”Ή Curriculum Offline Imitating Learning πŸ”₯

We propose Curriculum Offline Imitation Learning (COIL), which utilizes an experience picking strategy for imitating from adaptive neighboring policies with a higher return, and improves the current policy along curriculum stages.

πŸ”Ή Dealing with the Unknown: Pessimistic Offline Reinforcement Learning πŸ”₯

PessORL: penalize high values at unseen states in the dataset, and to cancel the penalization at in-distribution states.

πŸ”Ή Adversarially Trained Actor Critic for Offline Reinforcement Learning πŸ”₯ πŸŒ‹

We propose Adversarially Trained Actor Critic (ATAC) based on a two-player Stackelberg game framing of offline RL: A policy actor competes against an adversarially trained value critic, who finds data-consistent scenarios where the actor is inferior to the data-collection behavior policy. [robust policy improvement] [POSTER]

πŸ”Ή RVS: WHAT IS ESSENTIAL FOR OFFLINE RL VIA SUPERVISED LEARNING? 😢

Simply maximizing likelihood with a two-layer feedforward MLP is competitive with state-of-the-art results of substantially more complex methods based on TD learning or sequence modeling with Transformers. Carefully choosing model capacity (e.g., via regularization or architecture) and choosing which information to condition on (e.g., goals or rewards) are critical for performance. THE ESSENTIAL ELEMENTS OF OFFLINE RL VIA SUPERVISED LEARNING

πŸ”Ή IMPLICIT OFFLINE REINFORCEMENT LEARNING VIA SUPERVISED LEARNING 😢

IRvS: Implicit Behavior Cloning

πŸ”Ή Contrastive Learning as Goal-Conditioned Reinforcement Learning πŸ‘ πŸ”₯ πŸŒ‹

instead of adding representation learning parts to an existing RL algorithm, we show (contrastive) representation learning methods can be cast as RL algorithms in their own right.

πŸ”Ή When does return-conditioned supervised learning work for offline reinforcement learning? πŸ”₯

We find that RCSL (return-conditioned SL) returns the optimal policy under a set of assumptions that are stronger than those needed for the more traditional dynamic programming-based algorithms.

πŸ”Ή Implicit Behavioral Cloning πŸ”₯ πŸŒ‹ πŸ’₯

In this paper we showed that reformulating supervised imitation learning as a conditional energy-based modeling problem, with inference-time implicit regression, often greatly outperforms traditional explicit policy baselines.

πŸ”Ή Implicit Two-Tower Policies πŸ‘

Implicit Two-Tower (ITT) policies, where the actions are chosen based on the attention scores of their learnable latent representations with those of the input states.

πŸ”Ή Latent-Variable Advantage-Weighted Policy Optimization for Offline RL 😢

LAPO: we study an offline RL setup for learning from heterogeneous datasets where trajectories are collected using policies with different purposes, leading to a multi-modal data distribution.

πŸ”Ή AW-Opt: Learning Robotic Skills with Imitation and Reinforcement at Scale πŸ‘

Our aim is to test the scalability of prior IL + RL algorithms and devise a system based on detailed empirical experimentation that combines existing components in the most effective and scalable way.

πŸ”Ή Offline RL Policies Should be Trained to be Adaptive πŸ”₯ πŸŒ‹

APE-V: optimal policies for offline RL must be adaptive, depending not just on the current state but rather all the transitions seen so far during evaluation.

πŸ”Ή Deconfounded Imitation Learning πŸ”₯

We then introduce an algorithm for deconfounded imitation learning, which trains an inference model jointly with a latent-conditional policy. At test time, the agent alternates between updating its belief over the latent and acting under the belief.

πŸ”Ή Distance-Sensitive Offline Reinforcement Learning πŸ‘ πŸ”₯ πŸŒ‹

We propose a new method, DOGE (Distance-sensitive Offline RL with better GEneralization). DOGE marries dataset geometry with deep function approximators in offline RL, and enables exploitation in generalizable OOD areas rather than strictly constraining policy within data distribution.

πŸ”Ή RORL: Robust Offline Reinforcement Learning via Conservative Smoothing πŸ‘ πŸ”₯

We explicitly introduce regularization on the policy and the value function for states near the dataset and additional conservative value estimation on these OOD states.

πŸ”Ή On the Role of Discount Factor in Offline Reinforcement Learning πŸ”₯ πŸ”₯

This paper examines two distinct effects of discount factor in offline RL with theoretical analysis, namely the regularization effect and the pessimism effect.

πŸ”Ή LEARNING PSEUDOMETRIC-BASED ACTION REPRESENTATIONS FOR OFFLINE REINFORCEMENT LEARNING πŸ”₯ πŸŒ‹

BMA: This paper proposes an action representation learning framework for offline RL based on a pseudometric, which measures both the behavioral relation and thedata-distributional relation between actions.

πŸ”Ή PLAS: Latent Action Space for Offline Reinforcement Learning πŸ‘

We propose to simply learn the Policy in the Latent Action Space (PLAS) such that this requirement (OOD action) is naturally satisfied.

πŸ”Ή LET OFFLINE RL FLOW: TRAINING CONSERVATIVE AGENTS IN THE LATENT SPACE OF NORMALIZING FLOWS 😢

CNF: we build upon recent works on learning policies in latent action spaces and use a special form of Normalizing Flows for constructing a generative model, which we use as a conservative action encoder. diffusion + RL

πŸ”Ή Challenges and Opportunities in Offline Reinforcement Learning from Visual Observations

πŸ”Ή BEHAVIOR PRIOR REPRESENTATION LEARNING FOR OFFLINE REINFORCEMENT LEARNING πŸ”₯

BPR: we first learn a state representation by mimicking actions from the dataset, and then train a policy on top of the fixed representation, using any off-the-shelf Offline RL algorithm.

πŸ”Ή S2P: State-conditioned Image Synthesis for Data Augmentation in Offline Reinforcement Learning πŸ‘

we firstly propose a generative model, S2P (State2Pixel), which synthesizes the raw pixel of the agent from its corresponding state. It enables bridging the gap between the state and the image domain in RL algorithms, and virtually exploring unseen image distribution via model-based transition in the state space.

πŸ”Ή BEHAVIOR PRIOR REPRESENTATION LEARNING FOR OFFLINE REINFORCEMENT LEARNING πŸ”₯ πŸ’§

BPR: we first learn a state representation by mimicking actions from the dataset, and then train a policy on top of the fixed representation, using any off-the-shelf Offline RL algorithm.

πŸ”Ή AGENT-CONTROLLER REPRESENTATIONS: PRINCIPLED OFFLINE RL WITH RICH EXOGENOUS INFORMATION πŸ”₯

we propose to use multi-step inverse models, which have seen a great deal of interest in the RL theory community, to learn Agent-Controller Representations for Offline-RL (ACRO).

πŸ”Ή Back to the Manifold: Recovering from Out-of-Distribution States πŸ”₯ πŸŒ‹

We alleviate the distributional shift at the deployment time by introducing a recovery policy that brings the agent back to the training manifold whenever it steps out of the in-distribution states, e.g., due to an external perturbation.

πŸ”Ή State Deviation Correction for Offline Reinforcement Learning πŸ”₯

SDC: We first perturb the states sampled from the logged dataset, then simulate noisy next states on the basis of a dynamics model and the policy. We then train the policy to minimize the distances between the noisy next states and the offline dataset.

πŸ”Ή A Policy-Guided Imitation Approach for Offline Reinforcement Learning πŸ”₯ πŸ”₯

POR: During training, the guide-policy and execute-policy are learned using only data from the dataset, in a supervised and decoupled manner. During evaluation, the guide-policy guides the execute-policy by telling where it should go so that the reward can be maximized.

πŸ”Ή OFFLINE REINFORCEMENT LEARNING WITH ADAPTIVE BEHAVIOR REGULARIZATION πŸ”₯ πŸŒ‹ πŸ’₯

ABR: a novel offline RL algorithm that achieves an adaptive balance between cloning and improving over the behavior policy. By simply adding a sample-based regularizer to the Bellman backup, we construct an adaptively regularized objective for the policy improvement, which implicitly estimates the probability density of the behavior policy.

πŸ”Ή Dual Generator Offline Reinforcement Learning πŸ”₯ πŸŒ‹

DASCO: training two generators: one that maximizes return, with the other capturing the β€œremainder” of the data distribution in the offline dataset, such that the mixture of the two is close to the behavior policy.

πŸ”Ή Boosting Offline Reinforcement Learning via Data Rebalancing 😢

ReD (Return-based Data Rebalance)

πŸ”Ή Behaviour Discriminator: A Simple Data Filtering Method to Improve Offline Policy Learning 😢

We propose a behaviour discriminator (BD) concept, a novel and simple data filtering approach based on semisupervised learning, which can accurately discern expert data from a mixed-quality dataset.

πŸ”Ή Robust Imitation of a Few Demonstrations with a Backwards Model πŸ”₯

BMIL: We train a generative backwards dynamics model and generate short imagined trajectories from states in the demonstrations. By imitating both demonstrations and these model rollouts, the agent learns the demonstrated paths and how to get back onto these paths.

πŸ”Ή FROM PLAY TO POLICY: CONDITIONAL BEHAVIOR GENERATION FROM UNCURATED ROBOT DATA πŸ”₯

we present Conditional Behavior Transformers (C-BeT), a method that combines the multi-modal generation ability of Behavior Transformer with future-conditioned goal specification.

πŸ”Ή CONFIDENCE-CONDITIONED VALUE FUNCTIONS FOR OFFLINE REINFORCEMENT LEARNING πŸ”₯ πŸŒ‹ πŸ’₯

CCVL: we propose learning value functions that additionally condition on the degree of conservatism, which we dub confidence-conditioned value functions. We derive a new form of a Bellman backup that simultaneously learns Q-values for any degree of confidence with high probability.

πŸ”Ή Designing an Offline Reinforcement Learning Objective from Scratch πŸ”₯ πŸŒ‹

DOS: We leverage the contrastive learning framework to design a scoring metric that gives high scores to policies that imitate the actions yielding relatively high returns while avoiding those yielding relatively low returns.

πŸ”Ή πŸ”Ή πŸ”Ή πŸ”Ή πŸ”Ή πŸ”Ή

β­• Designs from Data | offline MBO

πŸ”Ή Designs from Data: Offline Black-Box Optimization via Conservative Training see here

πŸ”Ή OFFLINE MODEL-BASED OPTIMIZATION VIA NORMALIZED MAXIMUM LIKELIHOOD ESTIMATION πŸŒ‹ πŸ’§

we consider data-driven optimization problems where one must maximize a function given only queries at a fixed set of points. provides a principled approach to handling uncertainty and out-of-distribution inputs.

πŸ”Ή Model Inversion Networks for Model-Based Optimization πŸ”₯

MINs: This work addresses data-driven optimization problems, where the goal is to find an input that maximizes an unknown score or reward function given access to a dataset of inputs with corresponding scores.

πŸ”Ή RoMA: Robust Model Adaptation for Offline Model-based Optimization πŸ‘

RoMA consists of two steps: (a) a pre-training strategy to robustly train the proxy model and (b) a novel adaptation procedure of the proxy model to have robust estimates for a specific set of candidate solutions.

πŸ”Ή Conservative Objective Models for Effective Offline Model-Based Optimization πŸ”₯ πŸŒ‹

COMs: We propose conservative objective models (COMs), a method that learns a model of the objective function which lower bounds the actual value of the ground-truth objective on outof-distribution inputs and uses it for optimization.

πŸ”Ή DATA-DRIVEN OFFLINE OPTIMIZATION FOR ARCHITECTING HARDWARE ACCELERATORS πŸ‘

PRIME: we develop such a data-driven offline optimization method for designing hardware accelerators. PRIME learns a conservative, robust estimate of the desired cost function, utilizes infeasible points and optimizes the design against this estimate without any additional simulator queries during optimization.

πŸ”Ή Conditioning by adaptive sampling for robust design πŸŒ‹

πŸ”Ή DESIGN-BENCH: BENCHMARKS FOR DATA-DRIVEN OFFLINE MODEL-BASED OPTIMIZATION πŸ”₯

Design-Bench, a benchmark for offline MBO with a unified evaluation protocol and reference implementations of recent methods.

πŸ”Ή User-Interactive Offline Reinforcement Learning πŸ”₯ πŸ”₯

LION: We propose an algorithm that allows the user to tune this hyperparameter (the proximity of the learned policy to the original policy) at runtime, thereby overcoming both of the above mentioned issues simultaneously.

πŸ”Ή CONFIDENCE-CONDITIONED VALUE FUNCTIONS FOR OFFLINE REINFORCEMENT LEARNING

CCVL: We derive a new form of a Bellman backup that simultaneously learns Q-values for any degree of confidence with high probability. By conditioning on confidence, our value functions enable adaptive strategies during online evaluation by controlling for confidence level using the history of observations thus far.

πŸ”Ή Comparing Model-free and Model-based Algorithms for Offline Reinforcement Learning 😢

We compare model-free, model-based, as well as hybrid offline RL approaches on various industrial benchmark (IB) datasets to test the algorithms in settings closer to real world problems, including complex noise and partially observable states.

πŸ”Ή Autofocused oracles for model-based design πŸ”₯ πŸ”₯

we now reformulate the MBD problem as a non-zero-sum game, which suggests an algorithmic strategy for iteratively updating the oracle within any MBO algorithm

πŸ”Ή Data-Driven Offline Decision-Making via Invariant Representation Learning πŸ”₯ πŸŒ‹ πŸ’₯

IOM: our approach for addressing distributional shift by enforcing invariance between the learned representations of the training dataset and optimized decisions.

πŸ”Ή Let Offline RL Flow: Training Conservative Agents in the Latent Space of Normalizing Flows

CNF: we build upon recent works on learning policies (PLAS) in latent action spaces and use a special form of Normalizing Flows for constructing a generative model, which we use as a conservative action encoder.

πŸ”Ή Towards good validation metrics for generative models in offline model-based optimisation πŸ‘

we propose a principled evaluation framework for model-based optimisation to measure how well a generative model can extrapolate.

πŸ”Ή πŸ”Ή πŸ”Ή πŸ”Ή πŸ”Ή πŸ”Ή

πŸ”Ή The Challenges of Exploration for Offline Reinforcement Learning 😢

With Explore2Offline, we propose to evaluate the quality of collected data by transferring the collected data and inferring policies with reward relabelling and standard offline RL algorithms

πŸ”Ή RISK-AVERSE OFFLINE REINFORCEMENT LEARNING πŸ‘ πŸ”₯

we present the Offline RiskAverse Actor-Critic (O-RAAC), a model-free RL algorithm that is able to learn risk-averse policies in a fully offline setting.

πŸ”Ή One Risk to Rule Them All: A Risk-Sensitive Perspective on Model-Based Offline Reinforcement Learning

πŸ”Ή REVISITING DESIGN CHOICES IN OFFLINE MODEL-BASED REINFORCEMENT LEARNING πŸ”₯

we compare these heuristics (for model uncertainty), and design novel protocols to investigate their interaction with other hyperparameters, such as the number of models, or imaginary rollout horizon. Using these insights, we show that selecting these key hyperparameters using Bayesian Optimization produces superior configurations.

πŸ”Ή Latent Plans for Task-Agnostic Offline Reinforcement Learning πŸ‘

TACO-RL: we combine a low-level policy that learns latent skills via imitation learning and a high-level policy learned from offline reinforcement learning for skill-chaining the latent behavior priors.

πŸ”Ή DR3: VALUE-BASED DEEP REINFORCEMENT LEARNING REQUIRES EXPLICIT REGULARIZATION πŸ‘ πŸ”₯ πŸŒ‹ πŸ’₯

Our theoretical analysis shows that when existing models of implicit regularization are applied to temporal difference learning, the resulting derived regularizer favors degenerate solutions with excessive β€œaliasing”.

πŸ”Ή ACTOR-CRITIC ALIGNMENT FOR OFFLINE-TO-ONLINE REINFORCEMENT LEARNING πŸ‘

ACA: discarding Q-values learned offline as a means to combat distribution shift in offline2online RL

πŸ”Ή OFFLINE REINFORCEMENT LEARNING WITH CLOSEDFORM POLICY IMPROVEMENT OPERATORS πŸ‘ πŸ”₯

CFPI: the behavior constraint naturally motivates the use of first-order Taylor approximation, leading to a linear approximation of the policy objective.

πŸ”Ή CONTEXTUAL TRANSFORMER FOR OFFLINE REINFORCEMENT LEARNING πŸ”₯

we explore how prompts can help sequencemodeling based offline-RL algorithms --> extend the framework to the Meta-RL setting and propose Contextual Meta Transformer (CMT).

πŸ”Ή HYPER-DECISION TRANSFORMER FOR EFFICIENT ONLINE POLICY ADAPTATION 😢

HDT: augment the base DT with an adaptation module, whose parameters are initialized by a hyper-network. When encountering unseen tasks, the hyper-network takes a handful of demonstrations as inputs and initializes the adaptation module accordingly.

πŸ”Ή FROM PLAY TO POLICY: CONDITIONAL BEHAVIOR GENERATION FROM UNCURATED ROBOT DATA 😢

Conditional Behavior Transformers (C-BeT), a method that combines the multi-modal generation ability of Behavior Transformer with future-conditioned goal specification.

πŸ”Ή ACQL: AN ADAPTIVE CONSERVATIVE Q-LEARNING FRAMEWORK FOR OFFLINE REINFORCEMENT LEARNING πŸ‘ πŸ”₯ πŸ”₯

two weight functions, corresponding to the out-of-distribution (OOD) actions and actions in the dataset, are introduced to adaptively shape the Q-function.

πŸ”Ή ENTROPY-REGULARIZED MODEL-BASED OFFLINE REINFORCEMENT LEARNING πŸ‘ πŸ”₯ πŸŒ‹

EMO: we devised a hybrid loss function to minimize the negative log-likelihood of the model on the distribution of the offline data while maximizing the entropy in the areas that the support of data is none or minimal.

πŸ”Ή OPTIMAL TRANSPORT FOR OFFLINE IMITATION LEARNING 😢

OTR’s key idea is to use optimal transport to compute an optimal alignment between an unlabeled trajectory in the dataset and an expert demonstration to obtain a similarity measure that can be interpreted as a reward, which can then be used by an offline RL algorithm to learn the policy.

πŸ”Ή MIND THE GAP: OFFLINE POLICY OPTIMIZATION FOR IMPERFECT REWARDS πŸ”₯ πŸŒ‹

RGM: the upper layer optimizes a reward correction term that performs state-action visitation distribution matching w.r.t. a small set of expert data; and the lower layer solves a pessimistic RL problem with the corrected rewards. DICE

πŸ”Ή OFFLINE IMITATION LEARNING BY CONTROLLING THE EFFECTIVE PLANNING HORIZON πŸ”₯ πŸ’§

IGI: we analyze the effect of controlling the discount factor on offline IL and motivate that the discount factor can take a role of a regularizer to prevent the sampling error of the supplementary dataset from hurting the performance.

πŸ”Ή MUTUAL INFORMATION REGULARIZED OFFLINE REINFORCEMENT LEARNING πŸ‘ πŸ”₯

MISA constructs lower bounds of mutual information parameterized by the policy and Q-values. We show that optimizing this lower bound is equivalent to maximizing the likelihood of a one-step improved policy on the offline dataset.

πŸ”Ή ON THE IMPORTANCE OF THE POLICY STRUCTURE IN OFFLINE REINFORCEMENT LEARNING 😢 πŸ‘

V2AE (Value-Weighted Variational Auto-Encoder): The V2AE algorithm can be interpreted as an approach that divides the state-action space by learning the discrete latent variable and learns the corresponding sub-policies in each region.

πŸ”Ή THE CHALLENGES OF EXPLORATION FOR OFFLINE REINFORCEMENT LEARNING 😢

With Explore2Offline, we propose to evaluate the quality of collected data by transferring the collected data and inferring policies with reward relabelling and standard offline RL algorithms.

πŸ”Ή CURIOSITY-DRIVEN UNSUPERVISED DATA COLLECTION FOR OFFLINE REINFORCEMENT LEARNING 😢

CUDC:

πŸ”Ή DISCOVERING GENERALIZABLE MULTI-AGENT COORDINATION SKILLS FROM MULTI-TASK OFFLINE DATA πŸ‘

ODIS: first extracts task-invariant coordination skills from offline multi-task data and learns to delineate different agent behaviors with the discovered coordination skills. Then we train a coordination policy to choose optimal coordination skills with the centralized training and decentralized execution paradigm.

πŸ”Ή SKILL DISCOVERY DECISION TRANSFORMER 😢

We proposed Skill DT, a variant of Generalized DT, to explore the capabilities of offline skill discovery with sequence modelling.

πŸ”Ή HARNESSING MIXED OFFLINE REINFORCEMENT LEARNING DATASETS VIA TRAJECTORY WEIGHTING 😢 πŸ‘

We show that state-of-the-art offline RL algorithms are overly constrained in mixed datasets with high RPSV (return positive-sided varianc) and under-utilize the minority data.

πŸ”Ή EFFICIENT OFFLINE POLICY OPTIMIZATION WITH A LEARNED MODEL πŸ”₯

ROSMO: Instead of planning with the expensive MCTS, we use the learned model to construct an advantage estimation based on a one-step rollout. Policy improvements are towards the direction that maximizes the estimated advantage with regularization of the dataset. (MuZero Unplugged)

πŸ”Ή CONSERWEIGHTIVE BEHAVIORAL CLONING FOR RELIABLE OFFLINE REINFORCEMENT LEARNING 😢

ConserWeightive Behavioral Cloning (CWBC): trajectory weighting and conservative regularization.

πŸ”Ή TAMING POLICY CONSTRAINED OFFLINE REINFORCEMENT LEARNING FOR NON-EXPERT DEMONSTRATIONS πŸ”₯ πŸŒ‹

we first introduce gradient penalty over the learned value function to tackle the exploding Q-function gradients induced by the failed closeness constraint on non-expert states. + critic weighted constraint relaxation.

πŸ”Ή POLICY EXPANSION FOR BRIDGING OFFLINE-TOONLINE REINFORCEMENT LEARNING πŸ‘ πŸ”₯

PEX: After learning the offline policy, we use it as one candidate policy in a policy set, and further learn another policy that will be responsible for further learning as an expansion to the policy set. The two policies will be composed in an adaptive manner for interacting with the environment.

πŸ”Ή WHEN DATA GEOMETRY MEETS DEEP FUNCTION: GENERALIZING OFFLINE REINFORCEMENT LEARNING πŸ”₯ πŸŒ‹

DOGE marries dataset geometry with deep function approximators in offline RL, and enables exploitation in generalizable OOD areas rather than strictly constraining policy within data distribution.

πŸ”Ή THE IN-SAMPLE SOFTMAX FOR OFFLINE REINFORCEMENT LEARNING πŸ”₯ πŸ”₯

In-Sample Actor-Critic: POLICY OPTIMIZATION USING THE IN-SAMPLE SOFTMAX

πŸ”Ή IN-SAMPLE ACTOR CRITIC FOR OFFLINE REINFORCEMENT LEARNING πŸ”₯

In-sample Actor Critic (IAC): conduct in-sample learning by sampling-importance resampling.

πŸ”Ή OFFLINE Q-LEARNING ON DIVERSE MULTI-TASK DATA BOTH SCALES AND GENERALIZES

This work shows that offline Q-learning can scale to high-capacity models trained on large, diverse datasets.

πŸ”Ή PRE-TRAINING FOR ROBOTS: LEVERAGING DIVERSE MULTITASK DATA VIA OFFLINE RL

PTR: a framework based on offline RL that attempts to effectively learn new tasks by combining pre-training on existing robotic datasets with rapid fine-tuning on a new task.

πŸ”Ή DEEP AUTOREGRESSIVE DENSITY NETS VS NEURAL ENSEMBLES FOR MODEL-BASED OFFLINE REINFORCEMENT LEARNING

we ask what are the best dynamic system models, estimating their own uncertainty, for conservativism-based MBRL algorithm

πŸ”Ή SPARSE Q-LEARNING: OFFLINE REINFORCEMENT LEARNING WITH IMPLICIT VALUE REGULARIZATION πŸ‘ πŸ”₯ πŸŒ‹ πŸ’₯

Implicit Value Regularization (IVR) framework + Sparse Q-learning (SQL).

πŸ”Ή EXTREME Q-LEARNING: MAXENT RL WITHOUT ENTROPY πŸŒ‹ πŸ’§

Using EVT, we derive our Extreme Q-Learning framework and consequently online and offline MaxEnt Q-learning algorithms, that do not explicitly require access to a policy or its entropy.

πŸ”Ή IS CONDITIONAL GENERATIVE MODELING ALL YOU NEED FOR DECISION-MAKING? 😢

Decision Diffuser: a conditional generative model for sequential decision making.

πŸ”Ή SPRINT: SCALABLE SEMANTIC POLICY PRETRAINING VIA LANGUAGE INSTRUCTION RELABELING 😢

πŸ”Ή Robotic Skill Acquisition via Instruction Augmentation with Vision-Language Models πŸ”₯

DIAL: we utilize semi-supervised language labels leveraging the semantic understanding of CLIP to propagate knowledge onto large datasets of unlabelled demonstration data and then train language-conditioned policies on the augmented datasets.

πŸ”Ή PSEUDOMETRIC GUIDED ONLINE QUERY AND UPDATE FOR OFFLINE REINFORCEMENT LEARNING πŸ”₯

PGO2 has a structural design between the Q-neural network and the Siamese network, which guarantees simultaneous Q-network updating and pseudometric learning, promoting Q-network fine-tuning. In the inference phase, PGO2 solves convex optimizations to identify optimal query actions.

πŸ”Ή LIGHTWEIGHT UNCERTAINTY FOR OFFLINE REINFORCEMENT LEARNING VIA BAYESIAN POSTERIOR πŸ”₯

we propose a lightweight uncertainty quantifier based on approximate Bayesian inference in the last layer of the Q-network, which estimates the Bayesian posterior with minimal parameters in addition to the ordinary Q-network. Moreover, to avoid mode collapse in OOD samples and improve diversity in the Q-posterior, we introduce a repulsive force for OOD predictions in training.

πŸ”Ή Q-ENSEMBLE FOR OFFLINE RL: DON’T SCALE THE ENSEMBLE, SCALE THE BATCH SIZE

πŸ”Ή EFFECTIVE OFFLINE REINFORCEMENT LEARNING VIA CONSERVATIVE STATE VALUE ESTIMATION

CSVE:

πŸ”Ή CONTRASTIVE VALUE LEARNING: IMPLICIT MODELS FOR SIMPLE OFFLINE RL πŸ”₯ πŸŒ‹ πŸ’₯

CVL: learn a different type of model for offline RL, a model which (1) will not require predicting high-dimensional observations and (2) can be directly used to estimate Q-values without requiring either model-based rollouts or model-free temporal difference learning.

πŸ”Ή FINE-TUNING OFFLINE POLICIES WITH OPTIMISTIC ACTION SELECTION πŸ”₯

O3F: A key insight of our method is that we collect optimistic data without changing the training objective. To collect such exploratory data, we aim to use the knowledge embedded in the Q-function to direct exploration, i.e., selecting actions that are estimated to be better than the ones given by the policy.

πŸ”Ή SEMI-SUPERVISED OFFLINE REINFORCEMENT LEARNING WITH ACTION-FREE TRAJECTORIES 😢

SS-ORL contains three simple and scalable steps: (1) train a multi-transition inverse dynamics model on labelled data, which predicts actions based on transition sequences, (2) fill in proxy-actions for unlabelled data, and finally (3) train an offline RL agent on the combined dataset.

πŸ”Ή A CONNECTION BETWEEN ONE-STEP RL AND CRITIC REGULARIZATION IN REINFORCEMENT LEARNING πŸ‘ πŸ”₯ πŸŒ‹ πŸ’₯

applying a multi-step critic regularization method with a regularization coefficient of 1 yields the same policy as one-step RL.

πŸ”Ή DICHOTOMY OF CONTROL: SEPARATING WHAT YOU CAN CONTROL FROM WHAT YOU CANNOT πŸ”₯

DoC: conditioning the policy on a latent variable representation of the future, and designing a mutual information constraint that removes any information from the latent variable associated with randomness in the environment.

πŸ”Ή CORRECTING DATA DISTRIBUTION MISMATCH IN OFFLINE META-REINFORCEMENT LEARNING WITH FEWSHOT ONLINE ADAPTATION πŸŒ‹ πŸ’§

GCC: To align adaptation context with the meta-training distribution, GCC utilizes greedy task inference, which diversely samples β€œtask hypotheses” and selects a hypothesis with the highest return to update the belief

πŸ”Ή OFFLINE REINFORCEMENT LEARNING FROM HETEROSKEDASTIC DATA VIA SUPPORT CONSTRAINTS πŸ‘ πŸŒ‹ πŸ”₯

CQL (ReDS): the learned policy should be free to choose per state how closely to follow the behavior policy to maximize long-term return, as long as the learned policy stays within the support of the behavior policy.

πŸ”Ή OFFLINE REINFORCEMENT LEARNING VIA WEIGHTED f-DIVERGENCE πŸ”₯ πŸ”₯

DICE: we presented DICE via weighted f-divergence, a framework to control the degree of regularization on each state-action by adopting weight k to f-divergence.

πŸ”Ή Adaptive Behavior Cloning Regularization for Stable Offline-to-Online Reinforcement Learning 😢

We propose to adaptively weigh the behavior cloning loss during online fine-tuning based on the agent’s performance and training stability.

πŸ”Ή HYBRID RL: USING BOTH OFFLINE AND ONLINE DATA CAN MAKE RL EFFICIENT πŸ”₯

Hybrid Q-Learning or Hy-Q: we prove that the algorithm is both computationally and statistically efficient whenever the offline dataset supports a high-quality policy and the environment has bounded bilinear rank.

πŸ”Ή UniMASK: Unified Inference in Sequential Decision Problems πŸ”₯

We show that a single UniMASK model is often capable of carrying out many tasks with performance similar to or better than single-task models.

πŸ”Ή OFFLINE REINFORCEMENT LEARNING WITH CLOSEDFORM POLICY IMPROVEMENT OPERATORS πŸ”₯ πŸ‘

The behavior constraint naturally motivates the use of first-order Taylor approximation, leading to a linear approximation of the policy objective. As practical datasets are usually collected by heterogeneous policies, we model the behavior policies as a Gaussian Mixture, giving rise to a closed-form policy improvement operator.

πŸ”Ή STATE-AWARE PROXIMAL PESSIMISTIC ALGORITHMS FOR OFFLINE REINFORCEMENT LEARNING πŸ’§

State-Aware Conservative Q-Learning (SA-CQL):

πŸ”Ή IN-SAMPLE ACTOR CRITIC FOR OFFLINE REINFORCEMENT LEARNING πŸ”₯

IAC: utilizes sampling-importance resampling to execute in-sample policy evaluation. IAC only uses the target Q-values of the actions in the dataset to evaluate the trained policy, thus avoiding extrapolation error.

πŸ”Ή Future-conditioned Unsupervised Pretraining for Decision Transformer πŸ”₯ πŸŒ‹

PDT: this feature can be easily incorporated into a return-conditioned framework for online finetuning, by assigning return values to possible futures and sampling future embeddings based on their respective values.

Exploration

Causal Inference

Supervised RL & Goal-conditioned Policy

πŸ”Ή LEARNING TO REACH GOALS VIA ITERATED SUPERVISED LEARNING 😢 πŸ‘ ​

GCSL: an agent continually relabels and imitates the trajectories it generates to progressively learn goal-reaching behaviors from scratch. see more in RVS: see more in https://www.youtube.com/watch?v=sVPm7zOrBxM&ab_channel=RAIL πŸ”Ή RETHINKING GOAL-CONDITIONED SUPERVISED LEARNING AND ITS CONNECTION TO OFFLINE RL πŸ”₯ πŸ‘ πŸŒ‹

We propose Weighted GCSL (WGCSL), in which we introduce an advanced compound weight consisting of three parts (1) discounted weight for goal relabeling, (2) goal-conditioned exponential advantage weight, and (3) best advantage weight.

πŸ”Ή Learning Latent Plans from Play πŸ”₯

Play-GCBC; Play-LM; To learn control from play, we introduce Play-LMP, a selfsupervised method that learns to organize play behaviors in a latent space, then reuse them at test time to achieve specific goals.

πŸ”Ή Reward-Conditioned Policies πŸ‘ πŸ”₯ πŸŒ‹ ​

Non-expert trajectories collected from suboptimal policies can be viewed as optimal supervision, not for maximizing the reward, but for matching the reward of the given trajectory. Any experience collected by an agent can be used as optimal supervision when conditioned on the quality of a policy.

πŸ”Ή Training Agents using Upside-Down Reinforcement Learning πŸ”₯

UDRL: The goal of learning is no longer to maximize returns in expectation, but to learn to follow commands that may take various forms such as β€œachieve total reward R in next T time steps” or β€œreach state S in fewer than T time steps”.

πŸ”Ή All You Need Is Supervised Learning: From Imitation Learning to Meta-RL With Upside Down RL πŸ‘

Given the increased interest in the RL-as-SL paradigm, this work aims to construct a more general purpose agent/learning algorithm, but with more concrete implementation details and links to existing RL concepts than prior work.

πŸ”Ή Hierarchical Reinforcement Learning With Timed Subgoals

πŸ”Ή DEEP IMITATIVE MODELS FOR FLEXIBLE INFERENCE, PLANNING, AND CONTROL πŸ‘ πŸ”₯

We propose β€œImitative Models” to combine the benefits of IL and goal-directed planning. Imitative Models are probabilistic predictive models of desirable behavior able to plan interpretable expert-like trajectories to achieve specified goals.

πŸ”Ή ViKiNG: Vision-Based Kilometer-Scale Navigation with Geographic Hints

πŸ”Ή Simplifying Deep Reinforcement Learning via Self-Supervision 😢

SSRL: We demonstrate that, without policy gradient or value estimation, an iterative procedure of β€œlabeling” data and supervised regression is sufficient to drive stable policy improvement.

πŸ”Ή Search on the Replay Buffer: Bridging Planning and Reinforcement Learning πŸ”₯ πŸ‘ ​ ​

combines the strengths of planning and reinforcement learning

πŸ”Ή Phasic Self-Imitative Reduction for Sparse-Reward Goal-Conditioned Reinforcement Learning πŸ”₯ πŸŒ‹

PAIR: In the online phase, we perform RL training and collect rollout data while in the offline phase, we perform SL on those successful trajectories from the dataset. Task reduction.

πŸ”Ή SOLVING COMPOSITIONAL REINFORCEMENT LEARNING PROBLEMS VIA TASK REDUCTION πŸ”₯ πŸŒ‹

SIR: Task reduction tackles a hard-to-solve task by actively reducing it to an easier task whose solution is known by the RL agent.

πŸ”Ή DYNAMICAL DISTANCE LEARNING FOR SEMI-SUPERVISED AND UNSUPERVISED SKILL DISCOVERY

dynamical distances: a measure of the expected number of time steps to reach a given goal state from any other states

πŸ”Ή Contextual Imagined Goals for Self-Supervised Robotic Learning πŸ‘ ​​ ​ ​

using the context-conditioned generative model to set goals that are appropriate to the current scene.

πŸ”Ή Reverse Curriculum Generation for Reinforcement Learning πŸ‘ πŸ”₯ ​

Finding the optimal start-state distribution. Our method automatically generates a curriculum of start states that adapts to the agent’s performance, leading to efficient training on goal-oriented tasks.

πŸ”Ή Goal-Aware Prediction: Learning to Model What Matters πŸ‘ πŸ”₯ πŸ’₯ Introduction is good!

we propose to direct prediction towards task relevant information, enabling the model to be aware of the current task and encouraging it to only model relevant quantities of the state space, resulting in a learning objective that more closely matches the downstream task.

πŸ”Ή C-LEARNING: LEARNING TO ACHIEVE GOALS VIA RECURSIVE CLASSIFICATION πŸ‘ πŸ’¦ πŸŒ‹ πŸ’₯

This Q-function is not useful for predicting or controlling the future state distribution. Fundamentally, this problem arises because the relationship between the reward function, the Q function, and the future state distribution in prior work remains unclear. πŸ‘» [DIAYN?]

on-policy ---> off-policy ---> goal-conditioned.

πŸ”Ή LEARNING TO UNDERSTAND GOAL SPECIFICATIONS BY MODELLING REWARD πŸ‘ πŸ‘ ​

ADVERSARIAL GOAL-INDUCED LEARNING FROM EXAMPLES

A framework within which instruction-conditional RL agents are trained using rewards obtained not from the environment, but from reward models which are jointly trained from expert examples.

πŸ”Ή Intrinsically Motivated Goal-Conditioned Reinforcement Learning: a Short Survey πŸŒ‹ πŸŒ‹ πŸ’§

This paper proposes a typology of these methods [intrinsically motivated processes (IMP) (knowledge-based IMG + competence-based IMP); goal-conditioned RL agents] at the intersection of deep rl and developmental approaches, surveys recent approaches and discusses future avenues.

SEE: Language as a Cognitive Tool to Imagine Goals in Curiosity-Driven Exploration

πŸ”Ή Self-supervised Learning of Distance Functions for Goal-Conditioned Reinforcement Learning

πŸ”Ή PARROT: DATA-DRIVEN BEHAVIORAL PRIORS FOR REINFORCEMENT LEARNING πŸ‘

We propose a method for pre-training behavioral priors that can capture complex input-output relationships observed in successful trials from a wide range of previously seen tasks.

πŸ‘» see model-based ddl

πŸ”Ή LEARNING WHAT TO DO BY SIMULATING THE PAST πŸ‘ ​

we propose the Deep Reward Learning by Simulating the Past (Deep RLSP) algorithm.

πŸ”Ή Weakly-Supervised Reinforcement Learning for Controllable Behavior πŸ‘

two phase approach that learns a disentangled representation, and then uses it to guide exploration, propose goals, and inform a distance metric.

πŸ”Ή Replacing Rewards with Examples: Example-Based Policy Search via Recursive Classification πŸŒ‹ πŸ’₯

we derive a method based on recursive classification that eschews auxiliary reward functions and instead directly learns a value function from transitions and successful outcomes.

πŸ”Ή [C-learning: Learning to achieve goals via recursive classification]

πŸ”Ή Example-Based Offline Reinforcement Learning without Rewards

πŸ”Ή Outcome-Driven Reinforcement Learning via Variational Inference πŸ‘ πŸ”₯ πŸ’§

by framing the problem of achieving desired outcomes as variational inference, we can derive an off-policy RL algorithm, a reward function learnable from environment interactions, and a novel Bellman backup that contains a state–action dependent dynamic discount factor for the reward and bootstrap.

πŸ”Ή Discovering Diverse Solutions in Deep Reinforcement Learning πŸ‘ ​

learn infinitely many solutions by training a policy conditioned on a continuous or discrete low-dimensional latent variable.

πŸ”Ή Goal-Conditioned Reinforcement Learning with Imagined Subgoals πŸ‘ πŸ”₯ πŸŒ‹ ​ ​ ​

This high-level policy predicts intermediate states halfway to the goal using the value function as a reachability metric. We don’t require the policy to reach these subgoals explicitly. Instead, we use them to define a prior policy, and incorporate this prior into a KL-constrained pi scheme to speed up and reg.

πŸ”Ή Goal-Space Planning with Subgoal Models πŸ‘ πŸ”₯

Goal-Space Planning (GSP): The key idea is to plan in a much smaller space of subgoals, and use these (high-level) subgoal values to update state values using subgoal-conditioned mode.

πŸ”Ή Goal-Space Planning with Subgoal Models πŸ”₯ πŸŒ‹

constraining background planning to a set of (abstract) subgoals and learning only local, subgoal-conditioned models.

πŸ”Ή Discovering Generalizable Skills via Automated Generation of Diverse Tasks 😢

As opposed to prior work on unsupervised discovery of skills which incentivizes the skills to produce different outcomes in the same environment, our method pairs each skill with a unique task produced by a trainable task generator. Procedural content generation (PCG).

πŸ”Ή Unbiased Methods for Multi-Goal RL πŸ˜• πŸ‘ πŸ’§

First, we vindicate HER by proving that it is actually unbiased in deterministic environments, such as many optimal control settings. Next, for stochastic environments in continuous spaces, we tackle sparse rewards by directly taking the infinitely sparse reward limit.

πŸ”Ή Goal-Aware Cross-Entropy for Multi-Target Reinforcement Learning πŸ‘

GACE: that can be utilized in a self-supervised way using auto-labeled goal states alongside reinforcement learning.

πŸ”Ή DisCo RL: Distribution-Conditioned Reinforcement Learning for General-Purpose Policies πŸ”₯

Contextual policies provide this capability in principle, but the representation of the context determines the degree of generalization and expressivity. Categorical contexts preclude generalization to entirely new tasks. Goal-conditioned policies may enable some generalization, but cannot capture all tasks that might be desired.

πŸ”Ή Demonstration-Conditioned Reinforcement Learning for Few-Shot Imitation πŸ”₯

Given a training set consisting of demonstrations, reward functions and transition distributions for multiple tasks, the idea is to define a policy that takes demonstrations and current state as inputs, and to train this policy to maximize the average of the cumulative reward over the set of training tasks.

πŸ”Ή C-LEARNING: HORIZON-AWARE CUMULATIVE ACCESSIBILITY ESTIMATION πŸ˜• πŸ’§

we introduce the concept of cumulative accessibility functions, which measure the reachability of a goal from a given state within a specified horizon.

πŸ”Ή C-PLANNING: AN AUTOMATIC CURRICULUM FOR LEARNING GOAL-REACHING TASKS πŸ”₯

Frame the learning of the goal-conditioned policies as expectation maximization: the E-step corresponds to planning an optimal sequence of waypoints using graph search, while the M-step aims to learn a goal-conditioned policy to reach those waypoints.

πŸ”Ή Imitating Past Successes can be Very Suboptimal πŸ‘ πŸ”₯ πŸŒ‹

we prove that existing outcome-conditioned imitation learning methods do not necessarily improve the policy; rather, in some settings they can decrease the expected reward. Nonetheless, we show that a simple modification results in a method that does guarantee policy improvement, under some assumptions.

πŸ”Ή Bisimulation Makes Analogies in Goal-Conditioned Reinforcement Learning πŸ”₯ πŸŒ‹

We propose a new form of state abstraction called goal-conditioned bisimulation that captures functional equivariance, allowing for the reuse of skills to achieve new goals.

πŸ”Ή Goal-Conditioned Q-Learning as Knowledge Distillation πŸ”₯ πŸŒ‹

ReenGAGE: the current Q-value function and the target Qvalue estimate are both functions of the goal, and we would like to train the Q-value function to match its target for all goals.

++DATA++

πŸ”Ή Connecting the Dots Between MLE and RL for Sequence Prediction πŸ‘

A rich set of other algorithms such as RAML, SPG, and data noising, have also been developed from different perspectives. This paper establishes a formal connection between these algorithms. We present a generalized entropy regularized policy optimization formulation, and show that the apparently distinct algorithms can all be reformulated as special instances of the framework, with the only difference being the configurations of a reward function and a couple of hyperparameters.

πŸ”Ή Learning Data Manipulation for Augmentation and Weighting πŸ‘ πŸ”₯

We have developed a new method of learning different data manipulation schemes with the same single algorithm. Different manipulation schemes reduce to just different parameterization of the data reward function. The manipulation parameters are trained jointly with the target model parameters. (Equivalence between Data and Reward, Gradient-based Reward Learning)

Goal-relabeling & Self-imitation

πŸ”Ή Rewriting History with Inverse RL: Hindsight Inference for Policy Improvement

HIPI: MaxEnt RL and MaxEnt inverse RL optimize the same multi-task RL objective with respect to trajectories and tasks, respectively.

πŸ”Ή HINDSIGHT FORESIGHT RELABELING FOR META-REINFORCEMENT LEARNING πŸ”₯ πŸŒ‹

Hindsight Foresight Relabeling (HFR): We construct a relabeling distribution using the combination of hindsight, which is used to relabel trajectories using reward functions from the training task distribution, and foresight, which takes the relabeled trajectories and computes the utility of each trajectory for each task.

πŸ”Ή Generalized Hindsight for Reinforcement Learning πŸ‘

Generalized Hindsight: an approximate inverse reinforcement learning technique for relabeling behaviors with the right tasks.

πŸ”Ή GENERALIZED DECISION TRANSFORMER FOR OFFLINE HINDSIGHT INFORMATION MATCHING πŸ‘ πŸ”₯

We present Generalized Decision Transformer (GDT) for solving any HIM (hindsight information matching) problem, and show how different choices for the feature function and the anti-causal aggregator not only recover DT as a special case, but also lead to novel Categorical DT (CDT) and Bi-directional DT (BDT) for matching different statistics of the future.

πŸ”Ή Hindsight Curriculum-guided Hindsight Experience Replay COMPETITIVE EXPERIENCE REPLAY πŸ”₯ Energy-Based Hindsight Experience Prioritization DHER: HINDSIGHT EXPERIENCE REPLAY FOR DYNAMIC GOALS

πŸ”Ή Diversity-based Trajectory and Goal Selection with Hindsight Experience Replay πŸ‘

DTGSH: 1) a diversity-based trajectory selection module to sample valuable trajectories for the further goal selection; 2) a diversity-based goal selection module to select transitions with diverse goal states from the previously selected trajectories.

πŸ”Ή Exploration via Hindsight Goal Generation πŸ‘ πŸ”₯ ​

a novel algorithmic framework that generates valuable hindsight goals which are easy for an agent to achieve in the short term and are also potential for guiding the agent to reach the actual goal in the long term.

πŸ”Ή UNDERSTANDING HINDSIGHT GOAL RELABELING REQUIRES RETHINKING DIVERGENCE MINIMIZATION πŸ‘ πŸ”₯ πŸŒ‹ πŸ’§

we develop a unified objective for goal-reaching that explains such a connection, from which we can derive goal-conditioned supervised learning (GCSL) and the reward function in hindsight experience replay (HER) from first principles.

πŸ”Ή CURIOUS: Intrinsically Motivated Modular Multi-Goal Reinforcement Learning πŸ‘ ​

This paper proposes CURIOUS, an algorithm that leverages 1) a modular Universal Value Function Approximator with hindsight learning to achieve a diversity of goals of different kinds within a unique policy and 2) an automated curriculum learning mechanism that biases the attention of the agent towards goals maximizing the absolute learning progress.

πŸ”Ή Hindsight Generative Adversarial Imitation Learning πŸ”₯

achieving imitation learning satisfying no need of demonstrations. [see self-imitation learning]

πŸ”Ή MHER: Model-based Hindsight Experience Replay πŸ‘

Replacing original goals with virtual goals generated from interaction with a trained dynamics model.

πŸ”Ή Policy Continuation with Hindsight Inverse Dynamics πŸ‘ πŸ”₯ ​ ​

This approach learns from Hindsight Inverse Dynamics based on Hindsight Experience Replay.

πŸ”Ή USHER: Unbiased Sampling for Hindsight Experience Replay πŸ”₯ πŸŒ‹

We propose an asymptotically unbiased importance-sampling-based algorithm to address this problem without sacrificing performance on deterministic environments.

πŸ”Ή Experience Replay Optimization πŸ‘ πŸ”₯

Self-imitation; experience replay: we propose a novel experience replay optimization (ERO) framework which alternately updates two policies: the agent policy, and the replay policy. The agent is updated to maximize the cumulative reward based on the replayed data, while the replay policy is updated to provide the agent with the most useful experiences.

πŸ”Ή MODEL-AUGMENTED PRIORITIZED EXPERIENCE REPLAY 😢

We propose a novel experience replay method, which we call model-augmented priority experience replay (MaPER), that employs new learnable features driven from components in model-based RL (MbRL) to calculate the scores on experiences.

πŸ”Ή TOPOLOGICAL EXPERIENCE REPLAY πŸ”₯

TER: If the data sampling strategy ignores the precision of Q-value estimate of the next state, it can lead to useless and often incorrect updates to the Q-values.

πŸ”Ή Memory Augmented Policy Optimization for Program Synthesis and Semantic Parsing πŸ‘

Our key idea is to express the expected return objective as a weighted sum of two terms: an expectation over the high-reward trajectories inside a memory buffer, and a separate expectation over trajectories outside of the buffer.

πŸ”Ή RETRIEVAL-AUGMENTED REINFORCEMENT LEARNING πŸ‘

We augment an RL agent with a retrieval process (parameterized as a neural network) that has direct access to a dataset of experiences. The retrieval process is trained to retrieve information from the dataset that may be useful in the current context.

πŸ”Ή VARIATIONAL ORACLE GUIDING FOR REINFORCEMENT LEARNING πŸ”₯

Variational latent oracle guiding (VLOG) : An important but under-explored aspect is how to leverage oracle observation (the information that is invisible during online decision making, but is available during offline training) to facilitate learning.

πŸ”Ή WISH YOU WERE HERE: HINDSIGHT GOAL SELECTION FOR LONG-HORIZON DEXTEROUS MANIPULATION 😢

We extend hindsight relabelling mechanisms to guide exploration along task-specific distributions implied by a small set of successful demonstrations.

πŸ”Ή Hindsight Task Relabelling: Experience Replay for Sparse Reward Meta-RL 😢

HTR: we present a formulation of hindsight relabeling for meta-RL, which relabels experience during meta-training to enable learning to learn entirely using sparse reward.

πŸ”Ή Remember and Forget for Experience Replay πŸ‘

ReF-ER (1) skips gradients computed from experiences that are too unlikely with the current policy and (2) regulates policy changes within a trust region of the replayed behaviors.

πŸ”Ή BENCHMARKING SAMPLE SELECTION STRATEGIES FOR BATCH REINFORCEMENT LEARNING πŸ‘

We compare six variants of PER (temporal-difference error, n-step return, self-imitation learning objective, pseudo-count, uncertainty, and likelihood) based on various heuristic priority metrics that focus on different aspects of the offline learning setting.

πŸ”Ή An Equivalence between Loss Functions and Non-Uniform Sampling in Experience Replay πŸ”₯

We show that any loss function evaluated with non-uniformly sampled data can be transformed into another uniformly sampled loss function with the same expected gradient.

πŸ”Ή Self-Imitation Learning via Generalized Lower Bound Q-learning πŸ”₯

To provide a formal motivation for the potential performance gains provided by self-imitation learning, we show that n-step lower bound Q-learning achieves a trade-off between fixed point bias and contraction rate, drawing close connections to the popular uncorrected n-step Q-learning.

πŸ”Ή Understanding Multi-Step Deep Reinforcement Learning: A Systematic Study of the DQN Target 😢

we combine the n-step action-value alg. Retrace, Q-learning, Tree Backup, Sarsa, and Q(Οƒ) with an architecture analogous to DQN. It suggests that off-policy correction is not always necessary for learning from samples from the experience replay buffer.

πŸ”Ή Adaptive Trade-Offs in Off-Policy Learning πŸ”₯

We take a unifying view of this space of algorithms (off-policy learning algorithms ), and consider their trade-offs of three fundamental quantities: update variance, fixed-point bias, and contraction rate.

  • Imitation Learning (See Upper)

    πŸ”Ή To Follow or not to Follow: Selective Imitation Learning from Observations πŸ‘ ​

    imitating every step in the demonstration often becomes infeasible when the learner and its environment are different from the demonstration.

  • reward function

    πŸ”Ή QUANTIFYING DIFFERENCES IN REWARD FUNCTIONS πŸŒ‹ πŸ’¦ ​ ​

    we introduce the Equivalent-Policy Invariant Comparison (EPIC) distance to quantify the difference between two reward functions directly, without training a policy. We prove EPIC is invariant on an equivalence class of reward functions that always induce the same optimal policy.

πŸ”Ή IN-CONTEXT REINFORCEMENT LEARNING WITH ALGORITHM DISTILLATION πŸ”₯ πŸ”₯

We propose Algorithm Distillation (AD), a method for distilling reinforcement learning (RL) algorithms into neural networks by modeling their training histories with a causal sequence model.

Model-based RL & world models

πŸ”Ή A SURVEY ON MODEL-BASED REINFORCEMENT LEARNING

πŸ”Ή Learning Latent Dynamics for Planning from Pixels ​​ πŸ’¦ πŸ’¦ ​

πŸ”Ή DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION πŸ’¦ ​

πŸ”Ή CONTRASTIVE LEARNING OF STRUCTURED WORLD MODELS πŸ”₯ πŸŒ‹ ​ ​

πŸ”Ή Learning Predictive Models From Observation and Interaction πŸ”₯ ​related work is good!

By combining interaction and observation data, our model is able to learn to generate predictions for complex tasks and new environments without costly expert demonstrations.

πŸ”Ή medium Tutorial on Model-Based Methods in Reinforcement Learning (icml2020) πŸ’¦ ​

rail Model-Based Reinforcement Learning: Theory and Practice πŸ’¦ ​​ ​ ​

πŸ”Ή What can I do here? A Theory of Affordances in Reinforcement Learning πŸ‘ πŸ’§

β€œaffordances” to describe the fact that certain states enable an agent to do certain actions, in the context of embodied agents. In this paper, we develop a theory of affordances for agents who learn and plan in Markov Decision Processes.

πŸ”Ή When to Trust Your Model: Model-Based Policy Optimization πŸ”₯ πŸŒ‹ πŸ’§ πŸ’₯ ​

MBPO: we study the role of model usage in policy optimization both theoretically and empirically.

πŸ”Ή Visual Foresight: Model-based deep reinforcement learning for vision-based robotic control πŸ‘

We presented an algorithm that leverages self-supervision from visual prediction to learn a deep dynamics model on images, and show how it can be embedded into a planning framework to solve a variety of robotic control tasks.

πŸ”Ή LEARNING STATE REPRESENTATIONS VIA RETRACING IN REINFORCEMENT LEARNING πŸ‘

CCWM: a self-supervised instantiation of β€œlearning via retracing” for joint representation learning and generative model learning under the model-based RL setting.

πŸ”Ή Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization πŸ”₯ πŸŒ‹ πŸ’₯

we propose a novel concept of deployment efficiency, measuring the number of distinct data-collection policies that are used during policy learning. ​

πŸ”Ή Context-aware Dynamics Model for Generalization in Model-Based Reinforcement Learning πŸ‘ πŸ”₯ ​ ​

The intuition is that the true context of the underlying MDP can be captured from recent experiences. learning a global model that can generalize across different dynamics is a challenging task. To tackle this problem, we decompose the task of learning a global dynamics model into two stages: (a) learning a context latent vector that captures the local dynamics, then (b) predicting the next state conditioned on it

πŸ”Ή Trajectory-wise Multiple Choice Learning for Dynamics Generalization in Reinforcement Learning πŸ‘ ​

The main idea is updating the most accurate prediction head to specialize each head in certain environments with similar dynamics, i.e., clustering environments.

πŸ”Ή Optimism is All You Need: Model-Based Imitation Learning From Observation Alone πŸ’¦ ​

πŸ”Ή PlanGAN: Model-based Planning With Sparse Rewards and Multiple Goals πŸ‘

train an ensemble of conditional generative models (GANs) to generate plausible trajectories that lead the agent from its current state towards a specified goal. We then combine these imagined trajectories into a novel planning algorithm in order to achieve the desired goal as efficiently as possible.

πŸ”Ή MODEL-ENSEMBLE TRUST-REGION POLICY OPTIMIZATION πŸ‘

we propose to use an ensemble of models to maintain the model uncertainty and regularize the learning process.

πŸ”Ή Sample Efficient Reinforcement Learning via Model-Ensemble Exploration and Exploitation 😢

MEEE, a model-ensemble method that consists of optimistic exploration and weighted exploitation.

πŸ”Ή Regularizing Model-Based Planning with Energy-Based Models πŸ‘ πŸ”₯

We focus on planning with learned dynamics models and propose to regularize it using energy estimates of state transitions in the environment. ---> probabilistic ensembles with trajectory sampling (PETS), DAE regularization;

πŸ”Ή Model-Based Planning with Energy-Based Models πŸ”₯

We show that energy-based models (EBMs) are a promising class of models to use for model-based planning. EBMs naturally support inference of intermediate states given start and goal state distributions.

πŸ”Ή Can Autonomous Vehicles Identify, Recover From, and Adapt to Distribution Shifts? πŸ‘

RIP: Our method can detect and recover from some distribution shifts, reducing the overconfident and catastrophic extrapolations in OOD scenes.

πŸ”Ή Model-Based Reinforcement Learning via Latent-Space Collocation πŸ”₯

LatCo: It is easier to solve long-horizon tasks by planning sequences of states rather than just actions, as the effects of actions greatly compound over time and are harder to optimize.

πŸ”Ή Reinforcement Learning with Action-Free Pre-Training from Videos 😢

APV: we pre-train an action-free latent video prediction model, and then utilize the pre-trained representations for efficiently learning actionconditional world models on unseen environments.

πŸ”Ή Regularizing Trajectory Optimization with Denoising Autoencoders πŸ”₯

The idea is that we want to reward familiar trajectories and penalize unfamiliar ones because the model is likely to make larger errors for the unfamiliar ones.

πŸ”Ή Bridging Imagination and Reality for Model-Based Deep Reinforcement Learning πŸ‘ πŸ”₯ πŸ’§

BIRD: our basic idea is to leverage information from real trajectories to endow policy improvement on imaginations with awareness of discrepancy between imagination and reality.

πŸ”Ή ON-POLICY MODEL ERRORS IN REINFORCEMENT LEARNING πŸ‘

We present on-policy corrections (OPC) that combines real world data and a learned model in order to get the best of both worlds. The core idea is to exploit the real world data for on policy predictions and use the learned model only to generalize to different actions.

πŸ”Ή ALGORITHMIC FRAMEWORK FOR MODEL-BASED DEEP REINFORCEMENT LEARNING WITH THEORETICAL GUARANTEES πŸ‘ πŸŒ‹ πŸ’§ πŸ’¦ πŸ”₯

SLBO: We design a meta-algorithm with a theoretical guarantee of monotone improvement to a local maximum of the expected reward. The meta-algorithm iteratively builds a lower bound of the expected reward based on the estimated dynamical model and sample trajectories, and then maximizes the lower bound jointly over the policy and the model.

πŸ”Ή Model-Augmented Q-Learning 😢

We propose to estimate not only the Q-values but also both the transition and the reward with a shared network. We further utilize the estimated reward from the model estimators for Q-learning, which promotes interaction between the estimators.

πŸ”Ή Monotonic Robust Policy Optimization with Model Discrepancy πŸ‘ πŸŒ‹ πŸ’₯

We propose a robust policy optimization approach, named MRPO, for improving both the average and worst-case performance of policies. We theoretically derived a lower bound for the worst-case performance of a given policy over all environments, and formulated an optimization problem to optimize the policy and sampling distribution together, subject to constraints that bounded the update step in policy optimization and statistical distance between the worst and average case environments.

πŸ”Ή Policy Gradient Method For Robust Reinforcement Learning

[poster]

πŸ”Ή Trust the Model When It Is Confident: Masked Model-based Actor-Critic πŸ‘ πŸŒ‹

We derive a general performance bound for model-based RL and theoretically show that the divergence between the return in the model rollouts and that in the real environment can be reduced with restricted model usage.

πŸ”Ή MBDP: A Model-based Approach to Achieve both Robustness and Sample Efficiency via Double Dropout Planning πŸ‘ πŸ”₯

MBDP: Model-Based Double-dropout Planning (MBDP) consists of two kinds of dropout mechanisms, where the rollout-dropout aims to improve the robustness with a small cost of sample efficiency, while the model-dropout is designed to compensate for the lost efficiency at a slight expense of robustness.

πŸ”Ή PILCO (probabilistic inference for learning control) Deep PILCO πŸ‘

πŸ”Ή Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models πŸ‘

Employing uncertainty-aware dynamics models: we propose a new algorithm called probabilistic ensembles with trajectory sampling (PETS) that combines uncertainty-aware deep network dynamics models with sampling-based uncertainty propagation.

πŸ”Ή Plan Online, Learn Offline: Efficient Learning and Exploration via Model-Based Control πŸ‘ πŸ”₯

POLO utilizes a global value function approximation scheme, a local trajectory optimization subroutine, and an optimistic exploration scheme.

πŸ”Ή Learning Off-Policy with Online Planning πŸ”₯ πŸŒ‹

LOOP: We provide a theoretical analysis of this method, suggesting a tradeoff between model errors and value function errors and empirically demonstrate this tradeoff to be beneficial in deep reinforcement learning. H-step

πŸ”Ή Calibrated Model-Based Deep Reinforcement Learning πŸ”₯

This paper explores which uncertainties are needed for model-based reinforcement learning and argues that good uncertainties must be calibrated, i.e. their probabilities should match empirical frequencies of predicted events.

πŸ”Ή Model Imitation for Model-Based Reinforcement Learning πŸ‘ πŸŒ‹

We propose to learn the transition model by matching the distributions of multi-step rollouts sampled from the transition model and the real ones via WGAN. We theoretically show that matching the two can minimize the difference of cumulative rewards between the real transition and the learned one.

πŸ”Ή Model-based Policy Optimization with Unsupervised Model Adaptation πŸ”₯

We derive a lower bound of the expected return, which inspires a bound maximization algorithm by aligning the simulated and real data distributions. To this end, we propose a novel model-based rl framework AMPO, which introduces unsupervised model adaptation to minimize the integral probability metric (IPM) between feature distributions from real and simulated data.

πŸ”Ή Bidirectional Model-based Policy Optimization πŸ‘ πŸ”₯ πŸ”₯

We propose to additionally construct a backward dynamics model to reduce the reliance on accuracy in forward model predictions: Bidirectional Model-based Policy Optimization (BMPO) to utilize both the forward model and backward model to generate short branched rollouts for policy optimization.

πŸ”Ή Backward Imitation and Forward Reinforcement Learning via Bi-directional Model Rollouts πŸ”₯

BIFRL: the agent treats backward rollout traces as expert demonstrations for the imitation of excellent behaviors, and then collects forward rollout transitions for policy reinforcement.

πŸ”Ή Self-Consistent Models and Values πŸ”₯

We investigate a way of augmenting model-based RL, by additionally encouraging a learned model and value function to be jointly self-consistent.

πŸ”Ή MODEL-AUGMENTED ACTOR-CRITIC: BACKPROPAGATING THROUGH PATHS πŸ”₯ πŸŒ‹

MAAC: We exploit the fact that the learned simulator is differentiable and optimize the policy with the analytical gradient. The objective is theoretically analyzed in terms of the model and value error, and we derive a policy improvement expression with respect to those terms.

πŸ”Ή How to Learn a Useful Critic? Model-based Action-Gradient-Estimator Policy Optimization πŸ”₯

MAGE backpropagates through the learned dynamics to compute gradient targets in temporal difference learning, leading to a critic tailored for policy improvement.

πŸ”Ή Model-Based Value Expansion for Efficient Model-Free Reinforcement Learning

πŸ”Ή Discriminator Augmented Model-Based Reinforcement Learning πŸ‘ πŸŒ‹

Our approach trains a discriminative model to assess the quality of sampled transitions during planning, and upweight or downweight value estimates computed from high and low quality samples, respectively. We can learn biased dynamics models with advantageous properties, such as reduced value estimation variance during planning.

πŸ”Ή Variational Model-based Policy Optimization πŸ‘ πŸ”₯ πŸŒ‹ πŸ’§

Jointly learn and improve model and policy using a universal objective function: We propose model-based and model-free policy iteration (actor-critic) style algorithms for the E-step and show how the variational distribution learned by them can be used to optimize the M-step in a fully model-based fashion.

πŸ”Ή Model-Based Reinforcement Learning via Imagination with Derived Memory πŸ”₯

IDM: It enables the agent to learn policy from enriched diverse imagination with prediction-reliability weight, thus improving sample efficiency and policy robustness

πŸ”Ή MISMATCHED NO MORE: JOINT MODEL-POLICY OPTIMIZATION FOR MODEL-BASED RL πŸ”₯ πŸ”₯

We propose a single objective for jointly training the model and the policy, such that updates to either component increases a lower bound on expected return.

πŸ”Ή Operator Splitting Value Iteration πŸ”₯ πŸŒ‹ πŸ’₯

Inspired by the splitting approach in numerical linear algebra, we introduce Operator Splitting Value Iteration (OS-VI) for both Policy Evaluation and Control problems. OS-VI achieves a much faster convergence rate when the model is accurate enough.

πŸ”Ή Model-Based Reinforcement Learning via Meta-Policy Optimization πŸ”₯ πŸŒ‹

MB-MPO: Using an ensemble of learned dynamic models, MB-MPO meta-learns a policy that can quickly adapt to any model in the ensemble with one policy gradient step, which foregoes the strong reliance on accurate learned dynamics models.

πŸ”Ή A RELATIONAL INTERVENTION APPROACH FOR UNSUPERVISED DYNAMICS GENERALIZATION IN MODELBASED REINFORCEMENT LEARNING πŸ‘ πŸ”₯

We propose an intervention module to identify the probability of two estimated factors belonging to the same environment, and a relational head to cluster those estimated Zˆs are from the same environments with high probability, thus reducing the redundant information unrelated to the environment.

πŸ”Ή Value-Aware Loss Function for Model-based Reinforcement Learning πŸŒ‹

Estimating a generative model that minimizes a probabilistic loss, such as the log-loss, is an overkill because it does not take into account the underlying structure of decision problem and the RL algorithm that intends to solve it. We introduce a loss function that takes the structure of the value function into account.

πŸ”Ή Iterative Value-Aware Model Learning πŸŒ‹ πŸ’₯

Iterative VAML, that benefits from the structure of how the planning is performed (i.e., through approximate value iteration) to devise a simpler optimization problem.

πŸ”Ή Configurable Markov Decision Processes πŸŒ‹ πŸ’₯ πŸ’§

In Conf-MDPs the environment dynamics can be partially modified to improve the performance of the learning agent.

πŸ”Ή Bridging Worlds in Reinforcement Learning with Model-Advantage πŸŒ‹

we show relationships between the proposed model advantage and generalization in RL β€” using which we provide guarantees on the gap in performance of an agent in new environments.

πŸ”Ή Model-Advantage Optimization for Model-Based Reinforcement Learning πŸ‘ πŸ”₯ πŸ’§

a novel value-aware objective that is an upper bound on the absolute performance difference of a policy across two models.

πŸ”Ή Policy-Aware Model Learning for Policy Gradient Methods πŸ”₯ πŸŒ‹

Decision-Aware Model Learning: We focus on policy gradient planning algorithms and derive new loss functions for model learning that incorporate how the planner uses the model.

πŸ”Ή Gradient-Aware Model-Based Policy Search πŸ”₯ πŸŒ‹

Beyond Maximum Likelihood Model Estimation in Model-based Policy Search ppt

πŸ”Ή Model-Based Reinforcement Learning with Value-Targeted Regression πŸ˜•

πŸ”Ή Decision-Aware Model Learning for Actor-Critic Methods: When Theory Does Not Meet Practice 😢

we show empirically that combining Actor-Critic and value-aware model learning can be quite difficult and that naive approaches such as maximum likelihood estimation often achieve superior performance with less computational cost.

πŸ”Ή The Value Equivalence Principle for Model-Based Reinforcement Learning πŸŒ‹ πŸ’§

We introduced the principle of value equivalence: two models are value equivalent with respect to a set of functions and a set of policies if they yield the same updates of the former on the latter. Value equivalence formalizes the notion that models should be tailored to their future use and provides a mechanism to incorporate such knowledge into the model learning process.

πŸ”Ή Proper Value Equivalence πŸ’¦

We start by generalizing the concept of VE to order-k counterparts defined with respect to k applications of the Bellman operator. This leads to a family of VE classes that increase in size as k β†’ \inf. In the limit, all functions become value functions, and we have a special instantiation of VE which we call proper VE or simply PVE.

πŸ”Ή Minimax Model Learning πŸŒ‹ πŸ’§ πŸ’₯

our approach allows for greater robustness under model misspecification or distribution shift induced by learning/evaluating policies that are distinct from the data-generating policy.

πŸ”Ή On Effective Scheduling of Model-based Reinforcement Learning πŸ”₯

AutoMBPO: we aim to investigate how to appropriately schedule these hyperparameters, i.e., real data ratio, model training frequency, policy training iteration, and rollout length, to achieve optimal performance of Dyna-style MBRL algorithms.

πŸ”Ή Live in the Moment: Learning Dynamics Model Adapted to Evolving Policy πŸ‘

Policy-adaptation Model-based Actor-Critic (PMAC), which learns a policy-adapted dynamics model based on a policy-adaptation mechanism. This mechanism dynamically adjusts the historical policy mixture distribution to ensure the learned model can continually adapt to the state-action visitation distribution of the evolving policy.

πŸ”Ή When to Update Your Model: Constrained Model-based Reinforcement Learning πŸ”₯ πŸŒ‹ πŸ’§

CMLO: learning models from a dynamically-varying number of explorations benefit the eventual returns

πŸ”Ή Model-based Safe Deep Reinforcement Learning via a Constrained Proximal Policy Optimization Algorithm

We use an ensemble of neural networks with different initializations to tackle epistemic and aleatoric uncertainty issues faced during environment model learning.

πŸ”Ή DREAMERPRO: RECONSTRUCTION-FREE MODEL-BASED REINFORCEMENT LEARNING WITH PROTOTYPICAL REPRESENTATIONS

combining the prototypical representation learning with temporal dynamics learning.

πŸ”Ή PROTOTYPICAL CONTEXT-AWARE DYNAMICS GENERALIZATION FOR HIGH-DIMENSIONAL MODEL-BASED REINFORCEMENT LEARNING πŸ”₯

ProtoCAD: extracts useful contextual information with the help of the prototypes clustered over batch and benefits model-based RL in two folds: 1) It utilizes a temporally consistent prototypical regularizer; 2) A context representation is designed and can significantly improve the dynamics generalization ability.

πŸ”Ή IS MODEL ENSEMBLE NECESSARY? MODEL-BASED RL VIA A SINGLE MODEL WITH LIPSCHITZ REGULARIZED VALUE FUNCTION

We hypothesize that the key functionality of the probabilistic dynamics model ensemble is to regularize the Lipschitz condition of the value function using generated samples.

β­• Zero-Order Trajectory Optimizers / Planning

πŸ”Ή Sample-efficient Cross-Entropy Method for Real-time Planning πŸ’§

i Cross-Entropy Method (CEM):

πŸ”Ή Extracting Strong Policies for Robotics Tasks from Zero-Order Trajectory Optimizers πŸ’§

Adaptive Policy EXtraction (APEX):

β­• dynamic distance learning

πŸ”Ή MODEL-BASED VISUAL PLANNING WITH SELF-SUPERVISED FUNCTIONAL DISTANCES πŸ‘

We present a self-supervised method for model-based visual goal reaching, which uses both a visual dynamics model as well as a dynamical distance function learned using model-free rl. Related work!

β­• model-based offline

πŸ”Ή Representation Balancing MDPs for Off-Policy Policy Evaluation πŸ’§ ​

πŸ”Ή REPRESENTATION BALANCING OFFLINE MODEL-BASED REINFORCEMENT LEARNING πŸ’§ ​

πŸ”Ή Skill-based Model-based Reinforcement Learning πŸ‘

SkiMo: that enables planning in the skill space using a skill dynamics model, which directly predicts the skill outcomes, rather than predicting all small details in the intermediate states, step by step.

πŸ”Ή MODEL-BASED REINFORCEMENT LEARNING WITH MULTI-STEP PLAN VALUE ESTIMATION 😢

MPPVE: We employ the multi-step plan value estimation, which evaluates the expected discounted return after executing a sequence of action plans at a given state, and updates the policy by directly computing the multi-step policy gradient via plan value estimation.

πŸ”Ή Conservative Dual Policy Optimization for Efficient Model-Based Reinforcement Learning πŸ’§

CDPO:

πŸ”Ή Double Check Your State Before Trusting It: Confidence-Aware Bidirectional Offline Model-Based Imagination 😢

CABI: generates reliable samples and can be combined with any model-free offline RL method

πŸ”Ή VARIATIONAL LATENT BRANCHING MODEL FOR OFF-POLICY EVALUATION 😢

VLBM: try to accurately capture the dynamics underlying environments from offline training data that provide limited coverage of the state and action space; for model-based OPE

πŸ”Ή CONSERVATIVE BAYESIAN MODEL-BASED VALUE EXPANSION FOR OFFLINE POLICY OPTIMIZATION πŸ”₯ πŸŒ‹

CBOP: that trades off model-free and model-based estimates during the policy evaluation step according to their epistemic uncertainties, and facilitates conservatism by taking a lower bound on the Bayesian posterior value estimate

πŸ”Ή LATENT VARIABLE REPRESENTATION FOR REINFORCEMENT LEARNING πŸ’§

LV-Rep

πŸ”Ή PESSIMISTIC MODEL-BASED ACTOR-CRITIC FOR OFFLINE REINFORCEMENT LEARNING: THEORY AND ALGORITHMS πŸ’§

πŸ”Ή MODEM: ACCELERATING VISUAL MODEL-BASED REINFORCEMENT LEARNING WITH DEMONSTRATIONS πŸ”₯

We identify key ingredients for leveraging demonstrations in model learning – policy pretraining, targeted exploration, and oversampling of demonstration data – which forms the three phases of our model-based RL framework.

Training RL & Just Fast & Embedding? & OPE(DICE)

πŸ”Ή Reinforcement Learning: Theory and Algorithms πŸŒ‹ πŸ’¦

πŸ”Ή Leave no Trace: Learning to Reset for Safe and Autonomous Reinforcement Learning 😢 ​

πŸ”Ή Predictive Information Accelerates Learning in RL πŸ‘ ​

We train Soft Actor-Critic (SAC) agents from pixels with an auxiliary task that learns a compressed representation of the predictive information of the RL environment dynamics using a contrastive version of the Conditional Entropy Bottleneck (CEB) objective.

πŸ”Ή Speeding up Reinforcement Learning with Learned Models πŸ’¦ ​

πŸ”Ή DYNAMICS-AWARE EMBEDDINGS πŸ‘ ​

A forward prediction objective for simultaneously learning embeddings of states and action sequences.

πŸ”Ή DIVIDE-AND-CONQUER REINFORCEMENT LEARNING πŸ‘

we develop a novel algorithm that instead partitions the initial state space into β€œslices”, and optimizes an ensemble of policies, each on a different slice.

πŸ”Ή Continual Learning of Control Primitives: Skill Discovery via Reset-Games πŸ‘ πŸ”₯

We do this by exploiting the insight that the need to β€œreset" an agent to a broad set of initial states for a learning task provides a natural setting to learn a diverse set of β€œreset-skills".

πŸ”Ή DIFFERENTIABLE TRUST REGION LAYERS FOR DEEP REINFORCEMENT LEARNING πŸ‘ πŸ’§

We derive trust region projections based on the Kullback-Leibler divergence, the Wasserstein L2 distance, and the Frobenius norm for Gaussian distributions. Related work is good!

πŸ”Ή BENCHMARKS FOR DEEP OFF-POLICY EVALUATION πŸ‘ πŸ”₯ πŸ’§

DOPE is designed to measure the performance of OPE methods by 1) evaluating on challenging control tasks with properties known to be difficult for OPE methods, but which occur in real-world scenarios, 2) evaluating across a range of policies with different values, to directly measure performance on policy evaluation, ranking and selection, and 3) evaluating in ideal and adversarial settings in terms of dataset coverage and support.

πŸ”Ή Universal Off-Policy Evaluation πŸ˜•

We take the first steps towards a universal off-policy estimator (UnO) that estimates and bounds the entire distribution of returns, and then derives estimates and simultaneous bounds for all parameters of interest.

πŸ”Ή Trajectory-Based Off-Policy Deep Reinforcement Learning πŸ‘ πŸ”₯ πŸ’§ ​

Incorporation of previous rollouts via importance sampling greatly improves data-efficiency, whilst stochastic optimization schemes facilitate the escape from local optima.

πŸ”Ή Off-Policy Policy Gradient with State Distribution Correction πŸ’§

πŸ”Ή DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections πŸ”₯ πŸŒ‹ πŸ’₯

Off-Policy Policy Evaluation (OPE) ---> Learning Stationary Distribution Corrections ---> Off-Policy Estimation with Multiple Unknown Behavior Policies. , DualDICE, for estimating the discounted stationary distribution corrections.

πŸ”Ή AlgaeDICE: Policy Gradient from Arbitrary Experience πŸ‘ πŸŒ‹ πŸ’§ ​ ​ ​

We introduce a new formulation of max-return optimization that allows the problem to be re-expressed by an expectation over an arbitrary behavior-agnostic and off-policy data distribution. ALgorithm for policy Gradient from Arbitrary Experience via DICE (AlgaeDICE).

πŸ”Ή GENDICE: GENERALIZED OFFLINE ESTIMATION OF STATIONARY VALUES πŸ‘ πŸ”₯ πŸ”₯ ​​ ​ ​

Our approach is based on estimating a ratio that corrects for the discrepancy between the stationary and empirical distributions, derived from fundamental properties of the stationary distribution, and exploiting constraint reformulations based on variational divergence minimization.

πŸ”Ή GradientDICE: Rethinking Generalized Offline Estimation of Stationary Values πŸ˜• ​

πŸ”Ή Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation πŸ‘ πŸ”₯ πŸ’§ ​

The key idea is to apply importance sampling on the average visitation distribution of single steps of state-action pairs, instead of the much higher dimensional distribution of whole trajectories.

πŸ”Ή Off-Policy Evaluation via the Regularized Lagrangian πŸ”₯ πŸ˜• πŸ’§ ​

we unify these estimators (DICE) as regularized Lagrangians of the same linear program.

πŸ”Ή OptiDICE: Offline Policy Optimization via Stationary Distribution Correction Estimation πŸ‘ πŸŒ‹

Our algorithm, OptiDICE, directly estimates the stationary distribution corrections of the optimal policy and does not rely on policy-gradients, unlike previous offline RL algorithms.

πŸ”Ή SMODICE: Versatile Offline Imitation Learning via State Occupancy Matching

πŸ”Ή DEMODICE: OFFLINE IMITATION LEARNING WITH SUPPLEMENTARY IMPERFECT DEMONSTRATIONS πŸŒ‹ πŸ”₯ πŸ‘

An algorithm for offline IL from expert and imperfect demonstrations that achieves state-of-the-art performance on various offline IL tasks.

πŸ”Ή OFF-POLICY CORRECTION FOR ACTOR-CRITIC ALGORITHMS IN DEEP REINFORCEMENT LEARNING 😢

AC-Off-POC: Through a novel discrepancy measure computed by the agent’s most recent action decisions on the states of the randomly sampled batch of transitions, the approach does not require actual or estimated action probabilities for any policy and offers an adequate one-step importance sampling.

πŸ”Ή A Deep Reinforcement Learning Approach to Marginalized Importance Sampling with the Successor Representation πŸ”₯ πŸ”₯

We bridge the gap between MIS and deep RL by observing that the density ratio can be computed from the successor representation of the target policy. The successor representation can be trained through deep RL methodology and decouples the reward optimization from the dynamics of the environment, making the resulting algorithm stable and applicable to high-dimensional domains.

πŸ”Ή Policy-Adaptive Estimator Selection for Off-Policy Evaluation πŸ”₯ πŸ”₯

PAS-IF: synthesizes appropriate subpopulations by minimizing the squared distance between the importance ratio induced by the true evaluation policy and that induced by the pseudo evaluation policy (in OPE), which we call the importance fitting step.

πŸ”Ή How Far I’ll Go: Offline Goal-Conditioned Reinforcement Learning via f-Advantage Regression πŸŒ‹

Goal-conditioned f-Advantage Regression (GoFAR), a novel regressionbased offline GCRL algorithm derived from a state-occupancy matching perspective; the key intuition is that the goal-reaching task can be formulated as a stateoccupancy matching problem between a dynamics-abiding imitator agent and an expert agent that directly teleports to the goal.

πŸ”Ή Minimax Weight and Q-Function Learning for Off-Policy Evaluation πŸ”₯ πŸ’§ ​ ​

Minimax Weight Learning (MWL); Minimax Q-Function Learning. Doubly Robust Extension and Sample Complexity of MWL & MQL.

πŸ”Ή Minimax Value Interval for Off-Policy Evaluation and Policy Optimization πŸ‘ πŸ”₯ πŸŒ‹ πŸ’§

we derive the minimax value intervals by slightly altering the derivation of two recent methods [1], one of β€œweight-learning” style (Sec. 4.1) and one of β€œvalue-learning” style (Sec. 4.2), and show that under certain conditions, they merge into a single unified value interval whose validity only relies on either Q or W being well-specified (Sec. 4.3).

πŸ”Ή Reinforcement Learning via Fenchel-Rockafellar Duality πŸ”₯ πŸŒ‹ πŸ’§ ​ ​ ​

Policy Evaluation: LP form of Q ---> policy evaluation via largrangian ---> change the problem before applying duality (constant function, f-divergence, fenchel-rockafellar duality); Policy Optimization: policy gradient ---> offline policy gradient via the lagrangian ---> fenchel-rockafellar duality for the regularized optimization (regularization with the kl-d) ---> imitation learning; RL with the LP form of V: max-likelihood policy learning ---> policy evaluation with the V-lp; Undiscounted Settings

πŸ”Ή ADVANTAGE-WEIGHTED REGRESSION: SIMPLE AND SCALABLE OFF-POLICY REINFORCEMENT LEARNING πŸ‘ πŸ”₯ πŸ’₯

Our proposed approach, which we refer to as advantage-weighted regression (AWR), consists of two standard supervised learning steps: one to regress onto target values for a value function, and another to regress onto weighted target actions for the policy. [see MPO]

πŸ”Ή Relative Entropy Policy Search πŸ”₯ ​

REPS: it allows an exact policy update and may use data generated while following an unknown policy to generate a new, better policy.

πŸ”Ή Overcoming Exploration in Reinforcement Learning with Demonstrations πŸ”₯

We present a system to utilize demonstrations along with reinforcement learning to solve complicated multi-step tasks. Q-Filter. BC.

πŸ”Ή Fitted Q-iteration by Advantage Weighted Regression πŸ‘ πŸ”₯ πŸ”₯ ​

we show that by using a soft-greedy action selection the policy improvement step used in FQI can be simplified to an inexpensive advantage weighted regression. <--- greedy action selection in continuous.

πŸ”Ή Q-Value Weighted Regression: Reinforcement Learning with Limited Data πŸ”₯ πŸŒ‹

QWR: We replace the value function critic of AWR with a Q-value function. AWR --> QWR.

πŸ”Ή SUNRISE: A Simple Unified Framework for Ensemble Learning in Deep Reinforcement Learning πŸ”₯ πŸ‘

SUNRISE integrates two key ingredients: (a) ensemble-based weighted Bellman backups, which re-weight target Q-values based on uncertainty estimates from a Q-ensemble, and (b) an inference method that selects actions using the highest upper-confidence bounds for efficient exploration. [Rainbow]

πŸ”Ή Revisiting Rainbow: Promoting more Insightful and Inclusive Deep Reinforcement Learning Research πŸŒ‹

πŸ”Έ Reducing Variance in Temporal-Difference Value Estimation via Ensemble of Deep Networks πŸ”₯

we propose MeanQ, a simple ensemble method that estimates target values as ensemble means.

πŸ”Ή Explaining Off-Policy Actor-Critic From A Bias-Variance Perspective πŸ˜•

To understand an off-policy actor-critic algorithm, we show the policy evaluation error on the expected distribution of transitions decomposes into the Bellman error, the bias from policy mismatch, and the variance from sampling.

πŸ”Ή SOPE: Spectrum of Off-Policy Estimators πŸ‘ πŸ”₯ πŸŒ‹

Combining Trajectory-Based and Density-Based Importance Sampling: We present a new perspective in off-policy evaluation connecting two popular estimators, PDIS and SIS, and show that PDIS and SIS lie as endpoints on the Spectrum of Off-Policy Estimators SOPEn which interpolates between them.

πŸ”Ή Finite-Sample Analysis of Off-Policy TD-Learning via Generalized Bellman Operators πŸ”₯ πŸŒ‹ πŸ’§

Generalized Bellman Operator: QΟ€ (Ξ»), Tree-Backup(Ξ») (henceforth denoted by TB(Ξ»)), Retrace(Ξ»), and Q-trace.

πŸ”Ή Efficient Continuous Control with Double Actors and Regularized Critics πŸ”₯

DARC: We show that double actors help relieve overestimation bias in DDPG if built upon single critic, and underestimation bias in TD3 if built upon double critics. (they enhance the exploration ability of the agent.)

πŸ”Ή A Unified Off-Policy Evaluation Approach for General Value Function πŸ’§

GenTD:

Model Selection:

πŸ”Ή Pessimistic Model Selection for Offline Deep Reinforcement Learning πŸ”₯ πŸ’§

We propose a pessimistic model selection (PMS) approach for offline DRL with a theoretical guarantee, which features a provably effective framework for finding the best policy among a set of candidate models.

πŸ”Ή REVISITING BELLMAN ERRORS FOR OFFLINE MODEL SELECTION πŸ”₯

Supervised Bellman Validation (SBV)

πŸ”Ή PARAMETER-BASED VALUE FUNCTIONS πŸ‘ ​

Parameter-Based Value Functions (PBVFs) whose inputs include the policy parameters.

πŸ”Ή Reinforcement Learning without Ground-Truth State πŸ‘

relabeling the original goal with the achieved goal to obtain positive rewards

πŸ”Ή Ecological Reinforcement Learning πŸ‘

πŸ”Ή Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning πŸ‘ ​

πŸ”Ή Taylor Expansion Policy Optimization πŸ‘ πŸ’₯ πŸŒ‹ πŸ’§ ​

a policy optimization formalism that generalizes prior work (e.g., TRPO) as a firstorder special case. We also show that Taylor expansions intimately relate to off-policy evaluation.

πŸ”Ή Policy Information Capacity: Information-Theoretic Measure for Task Complexity in Deep Reinforcement Learning πŸ‘ ​

Policy Information Capacity: Information-Theoretic Measure for Task Complexity in Deep Reinforcement Learning.

πŸ”Ή Deep Reinforcement Learning with Robust and Smooth Policy πŸ‘

Motivated by the fact that many environments with continuous state space have smooth transitions, we propose to learn a smooth policy that behaves smoothly with respect to states. We develop a new framework β€” Smooth Regularized Reinforcement Learning (SR2L), where the policy is trained with smoothness-inducing regularization.

πŸ”Ή If MaxEnt RL is the Answer, What is the Question? πŸ‘ πŸ”₯ πŸŒ‹

πŸ”Ή Maximum Entropy RL (Provably) Solves Some Robust RL Problems πŸ”₯ πŸŒ‹

Our main contribution is a set of proofs showing that standard MaxEnt RL optimizes lower bounds on several possible robust objectives, reflecting a degree of robustness to changes in the dynamics and to certain changes in the reward.

πŸ”Ή Your Policy Regularizer is Secretly an Adversary πŸ’¦

πŸ”Ή Estimating Q(s, s') with Deep Deterministic Dynamics Gradients πŸ‘ πŸ”₯

We highlight the benefits of this approach in terms of value function transfer, learning within redundant action spaces, and learning off-policy from state observations generated by sub-optimal or completely random policies.

πŸ”Ή RANDOMIZED ENSEMBLED DOUBLE Q-LEARNING: LEARNING FAST WITHOUT A MODEL πŸ”₯

REDQ: (i) a Update-To-Data (UTD) ratio >> 1; (ii) an ensemble of Q functions; (iii) in-target minimization across a random subset of Q functions from the ensemble.

πŸ”Ή DROPOUT Q-FUNCTIONS FOR DOUBLY EFFICIENT REINFORCEMENT LEARNING 😢

To make REDQ more computationally efficient, we propose a method of improving computational efficiency called Dr.Q, which is a variant of REDQ that uses a small ensemble of dropout Q-functions.

πŸ”Ή Disentangling Dynamics and Returns: Value Function Decomposition with Future Prediction πŸ‘ ​

we propose a two-step understanding of value estimation from the perspective of future prediction, through decomposing the value function into a reward-independent future dynamics part and a policy-independent trajectory return part.

πŸ”Ή DisCor: Corrective Feedback in Reinforcement Learning via Distribution Correction

πŸ”Ή Regret Minimization Experience Replay in Off-Policy Reinforcement Learning πŸ”₯ πŸŒ‹

ReMERN and ReMERT: We start from the regret minimization objective, and obtain an optimal prioritization strategy for Bellman update that can directly maximize the return of the policy. The theory suggests that data with higher hindsight TD error, better on-policiness and more accurate Q value should be assigned with higher weights during sampling.

πŸ”Ή Off-Policy Policy Gradient Algorithms by Constraining the State Distribution Shift πŸ‘

Existing off-policy gradient based methods do not correct for the state distribution mismatch, and in this work we show that instead of computing the ratio over state distributions, we can instead minimize the KL between the target and behaviour state distributions to account for the state distribution shift in off-policy learning.

πŸ”Ή Fast Efficient Hyperparameter Tuning for Policy Gradient Methods 😢

Hyperparameter Optimisation on the Fly (HOOF): The main idea is to use existing trajectories sampled by the policy grad. method to optimise a one-step improvement objective, yielding a sample and computationally efficient alg. that is easy to implement.

πŸ”Ή REWARD SHIFTING FOR OPTIMISTIC EXPLORATION AND CONSERVATIVE EXPLOITATION 😢

We bring the key insight that a positive reward shifting leads to conservative exploitation, while a negative reward shifting leads to curiosity-driven exploration.

πŸ”Ή Exploiting Reward Shifting in Value-Based Deep RL

U

πŸ”Ή Heuristic-Guided Reinforcement Learning πŸ”₯ πŸŒ‹

HuRL: We show how heuristic-guided RL induces a much shorter-horizon subproblem that provably solves the original task. Our framework can be viewed as a horizon-based regularization for controlling bias and variance in RL under a finite interaction budget.

πŸ”Ή Using a Logarithmic Mapping to Enable Lower Discount Factors in Reinforcement Learning πŸ”₯ πŸ’§

Our results provide strong evidence for our hypothesis that large differences in action-gap sizes are detrimental to the performance of approximate RL.

πŸ”Ή ORCHESTRATED VALUE MAPPING FOR REINFORCEMENT LEARNING πŸ”₯

We present a general convergent class of reinforcement learning algorithms that is founded on two distinct principles: (1) mapping value estimates to a different space using arbitrary functions from a broad class, and (2) linearly decomposing the reward signal into multiple channels.

πŸ”Ή Discount Factor as a Regularizer in Reinforcement Learning πŸ”₯ πŸ‘

We show an explicit equivalence between using a reduced discount factor and adding an explicit regularization term to the algorithm’s loss.

πŸ”Ή Learning to Score Behaviors for Guided Policy Optimization πŸ”₯ πŸ’§

We show that by utilizing the dual formulation of the WD, we can learn score functions over policy behaviors that can in turn be used to lead policy optimization towards (or away from) (un)desired behaviors.

πŸ”Ή Dual Policy Distillation πŸŒ‹

DPD: a student-student framework in which two learners operate on the same environment to explore different perspectives of the environment and extract knowledge from each other to enhance their learning.

πŸ”Ή Jump-Start Reinforcement Learning πŸŒ‹

JSRL: an algorithm that employs two policies to solve tasks: a guide-policy, and an exploration-policy. By using the guide-policy to form a curriculum of starting states for the exploration-policy, we are able to efficiently improve performance on a set of simulated robotic tasks.

πŸ”Ή Distilling Policy Distillation πŸŒ‹ πŸ‘

We sought to highlight some of the strengths, weaknesses, and potential mathematical inconsistencies in different variants of distillation used for policy knowledge transfer in reinforcement learning.

πŸ”Ή Regularized Policies are Reward Robust πŸ”₯ πŸ’§ ​

we find that the optimal policy found by a regularized objective is precisely an optimal policy of a reinforcement learning problem under a worst-case adversarial reward.

πŸ”Ή Reinforcement Learning as One Big Sequence Modeling Problem πŸ‘ πŸ”₯ πŸ’§ ​ ​ ​

Addressing RL as a sequence modeling problem significantly simplifies a range of design decisions: we no longer require separate behavior policy constraints, as is common in prior work on offline model-free RL, and we no longer require ensembles or other epistemic uncertainty estimators, as is common in prior work on model-based RL.

πŸ”Ή Decision Transformer: Reinforcement Learning via Sequence Modeling πŸ”₯ πŸŒ‹

πŸ”Ή How Crucial is Transformer in Decision Transformer? 😢

These results suggest that the strength of the Decision Transformer for continuous control tasks may lie in the overall sequential modeling architecture and not in the Transformer per se.

πŸ”Ή Prompting Decision Transformer for Few-Shot Policy Generalization πŸ”₯ πŸŒ‹

We propose a Prompt-based Decision Transformer (Prompt-DT), which leverages the sequential modeling ability of the Transformer architecture and the prompt framework to achieve few-shot adaptation in offline RL.

πŸ”Ή Bootstrapped Transformer for Offline Reinforcement Learning 😢

Bootstrapped Transformer, which incorporates the idea of bootstrapping and leverages the learned model to self-generate more offline data to further boost the sequence model training. CABI (Double Check Your State Before Trusting It)

πŸ”Ή On-Policy Deep Reinforcement Learning for the Average-Reward Criterion πŸ‘ πŸ”₯

By addressing the average-reward criterion directly, we then derive a novel bound which depends on the average divergence between the two policies and Kemeny’s constant.

πŸ”Ή Average-Reward Reinforcement Learning with Trust Region Methods πŸ‘ πŸ”₯

Firstly, we develop a unified trust region theory with discounted and average criteria. With the average criterion, a novel performance bound within the trust region is derived with the Perturbation Analysis (PA) theory. Secondly, we propose a practical algorithm named Average Policy Optimization (APO), which improves the value estimation with a novel technique named Average Value Constraint.

πŸ”Ή Trust Region Policy Optimization πŸ‘ πŸ”₯ πŸ’₯ ​ ​

πŸ”Ή Benchmarking Deep Reinforcement Learning for Continuous Control πŸ‘ πŸ”₯ ​ ​

πŸ”Ή P3O: Policy-on Policy-off Policy Optimization πŸ”₯

This paper develops a simple alg. named P3O that interleaves offpolicy updates with on-policy updates.

πŸ”Ή Policy Gradients Incorporating the Future πŸ”₯

we consider the problem of incorporating information from the entire trajectory in model-free online and offline RL algorithms, enabling an agent to use information about the future to accelerate and improve its learning.

πŸ”Ή Generalizable Episodic Memory for Deep Reinforcement Learning πŸ‘ πŸ”₯

Generalizable Episodic Memory: We propose Generalizable Episodic Memory (GEM), which effectively organizes the state-action values of episodic memory in a generalizable manner and supports implicit planning on memorized trajectories.

πŸ”Ή Generalized Proximal Policy Optimization with Sample Reuse πŸ‘ πŸ”₯ πŸŒ‹

GePPO: We combine the theoretically supported stability benefits of on-policy algorithms with the sample efficiency of off-policy algorithms. We develop policy improvement guarantees that are suitable for the off-policy setting, and connect these bounds to the clipping mechanism used in Proximal Policy Optimization.

πŸ”Ή Zeroth-Order Supervised Policy Improvement πŸ”₯ πŸ”₯ ​

The policy learning of ZOSPI has two steps: 1), it samples actions and evaluates those actions with a learned value estimator, and 2) it learns to perform the action with the highest value through supervised learning.

πŸ”Ή SAMPLE EFFICIENT ACTOR-CRITIC WITH EXPERIENCE REPLAY πŸ‘ πŸ”₯

including truncated importance sampling with bias correction, stochastic dueling network architectures, and a new trust region policy optimization.

πŸ”Ή Safe and efficient off-policy reinforcement learning πŸ”₯ πŸ’§ ​ ​

Retrace(Ξ»); low variance, safe, efficient,

πŸ”Ή Relative Entropy Regularized Policy Iteration πŸ”₯ πŸ”₯

The algorithm alternates between Q-value estimation, local policy improvement and parametric policy fitting; hard constraints control the rate of change of the policy. And a decoupled update for mean and covarinace of a Gaussian policy avoids premature convergence. [see MPO]

πŸ”Ή Q-Learning for Continuous Actions with Cross-Entropy Guided Policies πŸ‘

Our approach trains the Q-function using iterative sampling with the Cross-Entropy Method (CEM), while training a policy network to imitate CEM’s sampling behavior.

πŸ”Ή SUPERVISED POLICY UPDATE FOR DEEP REINFORCEMENT LEARNING πŸ‘ πŸ”₯ πŸ”₯ ​ ​ ​

FORWARD AGGREGATE AND DISAGGREGATE KL CONSTRAINTS; BACKWARD KL CONSTRAINT; L CONSTRAINT;

πŸ”Ή Maximizing Ensemble Diversity in Deep Q-Learning 😢

Reducing overestimation bias by increasing representation dissimilarity in ensemble based deep q-learning.

πŸ”Ή Value-driven Hindsight Modelling πŸ˜•

we propose to learn what to model in a way that can directly help value prediction.

πŸ”Ή Dual Policy Iteration πŸ‘ πŸ”₯

DPI: We present and analyze Dual Policy Iterationβ€”a framework that alternatively computes a non-reactive policy via more advanced and systematic search, and updates a reactive policy via imitating the non-reactive one. [MPO, AWR]

πŸ”Ή Regret Minimization for Partially Observable Deep Reinforcement Learning πŸ˜• ​

πŸ”Ή THE IMPORTANCE OF PESSIMISM IN FIXED-DATASET POLICY OPTIMIZATION πŸŒ‹ πŸ’₯ πŸ‘» πŸ’¦ ​

Algs can follow the pessimism principle, which states that we should choose the policy which acts optimally in the worst possible world. We show why pessimistic algorithms can achieve good performance even when the dataset is not informative of every policy, and derive families of algorithms which follow this principle.

πŸ”Ή Bridging the Gap Between Value and Policy Based Reinforcement Learning πŸ”₯ πŸ’₯ πŸŒ‹ ​

we develop a new RL algorithm, Path Consistency Learning (PCL), that minimizes a notion of soft consistency error along multi-step action sequences extracted from both on- and off-policy traces.

πŸ”Ή Equivalence Between Policy Gradients and Soft Q-Learning πŸ‘ πŸ’§ ​

The soft Q-learning loss gradient can be interpreted as a policy gradient term plus a baseline-error-gradient term, corresponding to policy gradient instantiations such as A3C.

πŸ”Ή An operator view of policy gradient methods πŸ”₯

We use this framework to introduce operator-based versions of well-known policy gradient methods.

πŸ”Ή MAXIMUM REWARD FORMULATION IN REINFORCEMENT LEARNING πŸ’§

We formulate an objective function to maximize the expected maximum reward along a trajectory, derive a novel functional form of the Bellman equation, introduce the corresponding Bellman operators, and provide a proof of convergence.

πŸ”Ή Why Should I Trust You, Bellman? The Bellman Error is a Poor Replacement for Value Error πŸ”₯

The magnitude of the Bellman error is smaller for biased value functions due to cancellations caused from both sides of the Bellman equation. The relationship between Bellman error and value error is broken if the dataset is missing relevant transitions.

πŸ”Ή CONVERGENT AND EFFICIENT DEEP Q NETWORK ALGORITHM πŸ˜•

We show that DQN can indeed diverge and cease to operate in realistic settings. we propose a convergent DQN (C-DQN) that is guaranteed to converge.

πŸ”Ή LEARNING SYNTHETIC ENVIRONMENTS AND REWARD NETWORKS FOR REINFORCEMENT LEARNING πŸ‘

We use bi-level optimization to evolve SEs and RNs: the inner loop trains the RL agent, and the outer loop trains the parameters of the SE / RN via an evolution strategy.

πŸ”Ή IS HIGH VARIANCE UNAVOIDABLE IN RL? A CASE STUDY IN CONTINUOUS CONTROL

πŸ”Ή Reinforcement Learning with a Terminator πŸ”₯ πŸ”₯

We define the Termination Markov Decision Process (TerMDP), an extension of the MDP framework, in which episodes may be interrupted by an external non-Markovian observer.

πŸ”Ή Truly Deterministic Policy Optimization πŸ˜• πŸ‘

We proposed a deterministic policy gradient method (TDPO) based on the use of a deterministic Vine (DeVine) gradient estimator and the Wasserstein metric. We proved monotonic payoff guarantees for our method, and defined a novel surrogate for policy optimization.

πŸ”Ή Automated Reinforcement Learning (AutoRL): A Survey and Open Problems πŸ’¦

πŸ”Ή CGAR: Critic Guided Action Redistribution in Reinforcement Leaning 😢

the Q value predicted by the critic is a better signal to redistribute the action originally sampled from the policy distribution predicted by the actor.

πŸ”Ή Value Function Decomposition for Iterative Design of Reinforcement Learning Agents 😢

SAC-D: We also introduce decomposition-based tools that exploit this information, including a new reward influence metric, which measures each reward component’s effect on agent decision-making.

πŸ”Ή Emphatic Algorithms for Deep Reinforcement Learning πŸ”₯

πŸ”Ή Off-Policy Evaluation for Large Action Spaces via Embeddings

we propose a new OPE estimator that leverages marginalized importance weights when action embeddings provide structure in the action space. [poster]

πŸ”Ή PerSim: Data-efficient Offline Reinforcement Learning with Heterogeneous Agents via Personalized Simulators πŸ”₯

a model-based offline RL approach which first learns a personalized simulator for each agent by collectively using the historical trajectories across all agents, prior to learning a policy.

πŸ”Ή Value Function Decomposition for Iterative Design of Reinforcement Learning Agents πŸ”₯

we introduce SAC-D, a variant of soft actor-critic (SAC) adapted for value decomposition.

πŸ”Ή Gradient Temporal-Difference Learning with Regularized Corrections πŸ”₯

πŸ”Ή The Primacy Bias in Deep Reinforcement Learning πŸ”₯

"Your assumptions are your windows on the world. Scrub them off every once in a while, or the light won’t come in." [poster]

πŸ”Ή Memory-Constrained Policy Optimization 😢

In addition to using the proximity of one single old policy as the first trust region as done by prior works, we propose to form a second trust region through the construction of another virtual policy that represents a wide range of past policies.

πŸ”Ή A Temporal-Difference Approach to Policy Gradient Estimation πŸ”₯ πŸŒ‹

TDRC: By using temporaldifference updates of the gradient critic from an off-policy data stream, we develop the first estimator that side-steps the distribution shift issue in a model-free way. [poster]

πŸ”Ή gamma-models: Generative Temporal Difference Learning for Infinite-Horizon Prediction πŸ”₯ πŸŒ‹ πŸ’₯

Our goal is to make long-horizon predictions without the need to repeatedly apply a single-step model.

πŸ”Ή Generalised Policy Improvement with Geometric Policy Composition πŸŒ‹ πŸ’§

GGPI:

πŸ”Ή Taylor Expansions of Discount Factors πŸŒ‹ πŸ’§

We study the effect that this discrepancy of discount factors has during learning, and discover a family of objectives that interpolate value functions of two distinct discount factors.

πŸ”Ή Learning Retrospective Knowledge with Reverse Reinforcement Learning

Since such questions (how much fuel do we expect a car to have given it is at B at time t?) emphasize the influence of possible past events on the present, we refer to their answers as retrospective knowledge. We show how to represent retrospective knowledge with Reverse GVFs, which are trained via Reverse RL. [see GenTD]

πŸ”Ή A Generalized Bootstrap Target for Value-Learning, Efficiently Combining Value and Feature Predictions πŸ”₯

We focus on bootstrapping targets used when estimating value functions, and propose a new backup target, the lambda-return mixture, which implicitly combines value-predictive knowledge (used by TD methods) with (successor) feature-predictive knowledgeβ€”with a parameter lambda capturing how much to rely on each.

πŸ”Ή A Deeper Look at Discounting Mismatch in Actor-Critic Algorithms πŸ”₯

We then propose to interpret the discounting in the critic in terms of a bias-variance-representation trade-off and provide supporting empirical results. In the second scenario, we consider optimizing a discounted objective (gamma < 1) and propose to interpret the omission of the discounting in the actor update from an auxiliary task perspective and provide supporting empirical results.

πŸ”Ή An Analytical Update Rule for General Policy Optimization πŸ‘ πŸ”₯ πŸ”₯

The contributions of this paper include: (1) a new theoretical result that tightens existing bounds for local policy search using trust-region methods; (2) a closed-form update rule for general stochastic policies with monotonic improvement guarantee; [poster]

πŸ”Ή Deep Reinforcement Learning at the Edge of the Statistical Precipice πŸ”₯ πŸŒ‹

With the aim of increasing the field’s confidence in reported results with a handful of runs, we advocate for reporting interval estimates of aggregate performance and propose performance profiles to account for the variability in results, as well as present more robust and efficient aggregate metrics, such as interquartile mean scores, to achieve small uncertainty in results. [rilable]

πŸ”Ή Safe Policy Improvement Approaches and their Limitations

SPIBB

πŸ”Ή BSAC: Bayesian Strategy Network Based Soft Actor-Critic in Deep Reinforcement Learning 😢

(BSAC) model by organizing several sub-policies as a joint policy

πŸ”Ή Collect & Infer - a fresh look at data-efficient Reinforcement Learning πŸ‘

Collect and Infer, which explicitly models RL as two separate but interconnected processes, concerned with data collection and knowledge inference respectively.

πŸ”Ή A DATASET PERSPECTIVE ON OFFLINE REINFORCEMENT LEARNING πŸ‘ πŸ”₯

we define characteristics of behavioral policies as exploratory for yielding high expected information in their interaction with the Markov Decision Process (MDP) and as exploitative for having high expected return. Understanding the Effects of Dataset Characteristics on Offline Reinforcement Learning

πŸ”Ή Distributional Actor-Critic Ensemble for Uncertainty-Aware Continuous Control πŸ”₯

UA-DDPG: It exploits epistemic uncertainty to accelerate exploration and aleatoric uncertainty to learn a risk-sensitive policy (also known as risk-averse RL, safe RL, and conservative RL).

πŸ”Ή On the Reuse Bias in Off-Policy Reinforcement Learning πŸ”₯ πŸŒ‹

BIRIS: We further provide a high-probability upper bound of the Reuse Bias, and show that controlling one term of the upper bound can control the Reuse Bias by introducing the concept of stability for off-policy algorithms

πŸ”Ή Reinforcement Learning with a Terminator πŸ”₯ πŸ”₯

We define the Termination Markov Decision Process (TerMDP), an extension of the MDP framework, in which episodes may be interrupted by an external non-Markovian observer

πŸ”Ή Low-Rank Modular Reinforcement Learning via Muscle Synergy πŸ”₯

SOLAR: exploits the redundant nature of DoF in robot control. Actuators are grouped into synergies by an unsupervised learning method, and a synergy action is learned to control multiple actuators in synchrony. In this way, we achieve a low-rank control at the synergy level.

πŸ”Ή INAPPLICABLE ACTIONS LEARNING FOR KNOWLEDGE TRANSFER IN REINFORCEMENT LEARNING πŸ”₯

SDAS-MDP: Knowing this information (inapplicable actions) can help reduce the sample complexity of RL algorithms by masking the inapplicable actions from the policy distribution to only explore actions relevant to finding an optimal policy.

πŸ”Ή Rethinking Value Function Learning for Generalization in Reinforcement Learning 😢

Dynamics-aware Delayed-Critic Policy Gradient (DDCPG): a policy gradient algorithm that implicitly penalizes value estimates by optimizing the value network less frequently with more training data than the policy network.

πŸ”Ή VARIATIONAL LATENT BRANCHING MODEL FOR OFF-POLICY EVALUATION πŸ‘ πŸ”₯

VLBM leverages and extends the variational inference framework with the recurrent state alignment (RSA), which is designed to capture as much information underlying the limited training data, by smoothing out the information flow between the variational (encoding) and generative (decoding) part of VLBM. Moreover, we also introduce the branching architecture to improve the model’s robustness against randomly initialized model weights.

MARL

πŸ”Ή Counterfactual Multi-Agent Policy Gradients πŸ”₯

COMA: to address the challenges of multi-agent credit assignment, it uses a counterfactual baseline that marginalises out a single agent’s action, while keeping the other agents’ actions fixed.

πŸ”Ή Value-Decomposition Networks For Cooperative Multi-Agent Learning πŸ‘

VDN: aims to learn an optimal linear value decomposition from the team reward signal, by back-propagating the total Q gradient through deep neural networks representing the individual component value functions.

πŸ”Ή QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning πŸ”₯ πŸŒ‹

QMIX employs a network that estimates joint action-values as a complex non-linear combination of per-agent values that condition only on local observations.

πŸ”Ή Best Possible Q-Learning πŸŒ‹

BQL: Best Possible Operator

πŸ”Ή TRUST REGION POLICY OPTIMISATION IN MULTI-AGENT REINFORCEMENT LEARNING πŸŒ‹

We extend the theory of trust region learning to MARL. Central to our findings are the multi-agent advantage decomposition lemma and the sequential policy update scheme. Based on these, we develop Heterogeneous-Agent Trust Region Policy Optimisation (HATPRO) and Heterogeneous-Agent Proximal Policy Optimisation (HAPPO) algorithms.

πŸ”Ή Beyond Rewards: a Hierarchical Perspective on Offline Multiagent Behavioral Analysis πŸ‘ πŸ”₯

MOHBA: We introduce a model-agnostic method for discovery of behavior clusters in multiagent domains, using variational inference to learn a hierarchy of behaviors at the joint and local agent levels.

πŸ”Ή Discovered Policy Optimisation πŸ”₯

we explore the Mirror Learning space by meta-learning a β€œdrift” function. We refer to the result as Learnt Policy Optimisation (LPO). By analysing LPO we gain original insights into policy optimisation which we use to formulate a novel, closed-form RL algorithm, Discovered Policy Optimisation (DPO).

Constrained RL

πŸ”Ή A Unified View of Entropy-Regularized Markov Decision Processes πŸ”₯ πŸ’§ ​

using the conditional entropy of the joint state-action distributions as regularization yields a dual optimization problem closely resembling the Bellman optimality equations

πŸ”Ή A Theory of Regularized Markov Decision Processes πŸ‘πŸŒ‹ πŸ’₯ πŸ’§

We have introduced a general theory of regularized MDPs, where the usual Bellman evaluation operator is modified by either a fixed convex function or a Bregman divergence between consecutive policies. We shown how many (variations of) existing algorithms could be derived from this general algorithmic scheme, and also analyzed and discussed the related propagation of errors.

πŸ”Ή Mirror Learning: A Unifying Framework of Policy Optimisation πŸ‘ 😢

we introduce a novel theoretical framework, named Mirror Learning, which provides theoretical guarantees to a large class of algorithms, including TRPO and PPO.

πŸ”Ή Munchausen Reinforcement Learning πŸ‘ ​ πŸ”₯ πŸŒ‹ πŸ’§

Yet, another estimate could be leveraged to bootstrap RL: the current policy. Our core contribution stands in a very simple idea: adding the scaled log-policy to the immediate reward.

πŸ”Ή Leverage the Average: an Analysis of KL Regularization in Reinforcement Learning πŸ‘ πŸ”₯ πŸŒ‹ πŸ’₯ πŸ’₯ πŸ’§ ​

Convex Conjugacy for KL and Entropy Regularization; 1) Mirror Descent MPI: SAC, Soft Q-learning; Softmax DQN, mellowmax policy, TRPO, MPO, DPP, CVI:droplet:; 2) Dual Averaging MPI:droplet::

πŸ”Ή Proximal Iteration for Deep Reinforcement Learning πŸ”₯

Our contribution is to employ Proximal Iteration for optimization in deep RL.

πŸ”Ή Theoretical Analysis of Efficiency and Robustness of Softmax and Gap-Increasing Operators in Reinforcement Learning πŸ‘ πŸ’§

We propose and analyze conservative value iteration (CVI), which unifies value iteration, soft value iteration, advantage learning, and dynamic policy programming.

πŸ”Ή Momentum in Reinforcement Learning πŸ‘ πŸ”₯

We derive Momentum Value Iteration (MoVI), a variation of Value iteration that incorporates this momentum idea. Our analysis shows that this allows MoVI to average errors over successive iterations.

πŸ”Ή Geometric Value Iteration: Dynamic Error-Aware KL Regularization for Reinforcement Learning πŸ‘ πŸ”₯ πŸŒ‹

we propose a novel algorithm, Geometric Value Iteration (GVI), that features a dynamic error-aware KL coefficient design with the aim of mitigating the impact of errors on performance. Our experiments demonstrate that GVI can effectively exploit the trade-off between learning speed and robustness over uniform averaging of a constant KL coefficient.

πŸ”Ή Near Optimal Policy Optimization via REPS πŸŒ‹ πŸ’§

Relative entropy policy search (REPS)

πŸ”Ή On Pathologies in KL-Regularized Reinforcement Learning from Expert Demonstrations πŸ”₯ πŸ’§

we show that KL-regularized reinforcement learning with behavioral reference policies derived from expert demonstrations can suffer from pathological training dynamics that can lead to slow, unstable, and suboptimal online learning.

πŸ”Ή ON COVARIATE SHIFT OF LATENT CONFOUNDERS IN IMITATION AND REINFORCEMENT LEARNING πŸŒ‹ πŸ’§

We consider the problem of using expert data with unobserved confounders for imitation and reinforcement learning.

πŸ”Ή Constrained Policy Optimization πŸ‘ πŸ”₯ πŸ”₯ πŸŒ‹ ​

We propose Constrained Policy Optimization (CPO), the first general-purpose policy search algorithm for constrained reinforcement learning with guarantees for near-constraint satisfaction at each iteration.

πŸ”Ή Reward Constrained Policy Optimization πŸ‘ πŸ”₯ ​ ​

we present a novel multi-timescale approach for constrained policy optimization, called β€˜Reward Constrained Policy Optimization’ (RCPO), which uses an alternative penalty signal to guide the policy towards a constraint satisfying one.

πŸ”Ή PROJECTION-BASED CONSTRAINED POLICY OPTIMIZATION πŸ‘ πŸ”₯ ​ ​

the first step performs a local reward improvement update, while the second step reconciles any constraint violation by projecting the policy back onto the constraint set.

πŸ”Ή First Order Constrained Optimization in Policy Space πŸ”₯ πŸ‘

Using data generated from the current policy, FOCOPS first finds the optimal update policy by solving a constrained optimization problem in the nonparameterized policy space. FOCOPS then projects the update policy back into the parametric policy space.

πŸ”Ή Reinforcement Learning with Convex Constraints πŸ”₯ πŸŒ‹ πŸ’§

we propose an algorithmic scheme that can handle a wide class of constraints in RL tasks, specifically, any constraints that require expected values of some vector measurements (such as the use of an action) to lie in a convex set.

πŸ”Ή Batch Policy Learning under Constraints πŸŒ‹ πŸ’§ ​ ​

propose a flexible meta-algorithm that admits any batch reinforcement learning and online learning procedure as subroutines.

πŸ”Ή A Primal-Dual Approach to Constrained Markov Decision Processes πŸŒ‹ πŸ’§

πŸ”Ή Reward is enough for convex MDPs πŸ”₯ πŸŒ‹ πŸ’₯ πŸ’§ ​

It is easy to see that Convex MDPs in which goals are expressed as convex functions of stationary distributions cannot, in general, be formulated in this manner (maximising a cumulative reward).

πŸ”Ή Challenging Common Assumptions in Convex Reinforcement Learning πŸ‘ πŸ”₯

We show that erroneously optimizing the infinite trials objective in place of the actual finite trials one, as it is usually done, can lead to a significant approximation error.

πŸ”Ή DENSITY CONSTRAINED REINFORCEMENT LEARNING πŸ‘ πŸŒ‹ ​

We prove the duality between the density function and Q function in CRL and use it to develop an effective primal-dual algorithm to solve density constrained reinforcement learning problems.

πŸ”Ή Control Regularization for Reduced Variance Reinforcement Learning πŸ”₯ πŸŒ‹

CORERL: we regularize the behavior of the deep policy to be similar to a policy prior, i.e., we regularize in function space. We show that functional reg. yields a bias-variance trade-off, and propose an adaptive tuning strategy to optimize this trade-off.

πŸ”Ή REGULARIZATION MATTERS IN POLICY OPTIMIZATION - AN EMPIRICAL STUDY ON CONTINUOUS CONTROL 😢

We present the first comprehensive study of regularization techniques with multiple policy optimization algorithms on continuous control tasks.

πŸ”Ή REINFORCEMENT LEARNING WITH SPARSE REWARDS USING GUIDANCE FROM OFFLINE DEMONSTRATION πŸ”₯ πŸŒ‹ πŸ’₯

The proposed algorithm, which we call the Learning Online with Guidance Offline (LOGO) algorithm, merges a policy improvement step with an additional policy guidance step by using the offline demonstration data.

πŸ”Ή MIRROR DESCENT POLICY OPTIMIZATION πŸ‘ πŸ”₯

We derive on-policy and off-policy variants of MDPO (mirror descent policy optimization), while emphasizing important design choices motivated by the existing theory of MD in RL.

πŸ”Ή BREGMAN GRADIENT POLICY OPTIMIZATION πŸ”₯ πŸ”₯

We propose a Bregman gradient policy optimization (BGPO) algorithm based on both the basic momentum technique and mirror descent iteration.

πŸ”Ή Safe Policy Improvement by Minimizing Robust Baseline Regret [see more in offline_rl]

πŸ”Ή Safe Policy Improvement with Baseline Bootstrapping πŸ‘ πŸ”₯ πŸŒ‹

Our approach, called SPI with Baseline Bootstrapping (SPIBB), is inspired by the knows-what-it-knows paradigm: it bootstraps the trained policy with the baseline when the uncertainty is high.

πŸ”Ή Safe Policy Improvement with Soft Baseline Bootstrapping πŸ‘ πŸ”₯ πŸŒ‹

Instead of binarily classifying the state-action pairs into two sets (the uncertain and the safe-to-train-on ones), we adopt a softer strategy that controls the error in the value estimates by constraining the policy change according to the local model uncertainty.

πŸ”Ή SPIBB-DQN: Safe batch reinforcement learning with function approximation

πŸ”Ή Safe policy improvement with estimated baseline bootstrapping

πŸ”Ή Incorporating Explicit Uncertainty Estimates into Deep Offline Reinforcement Learning :fire

deep-SPIBB: Evaluation step regularization + Uncertainty.

πŸ”Ή Accelerating Safe Reinforcement Learning with Constraint-mismatched Baseline Policies πŸ”₯ πŸŒ‹

SPACE: We propose an iterative policy optimization algorithm that alternates between maximizing expected return on the task, minimizing distance to the baseline policy, and projecting the policy onto the constraint satisfying set.

πŸ”Ή Conservative and Adaptive Penalty for Model-Based Safe Reinforcement Learning πŸŒ‹ πŸ”₯

We propose Conservative and Adaptive Penalty (CAP), a model-based safe RL framework that accounts for potential modeling errors by capturing model uncertainty and adaptively exploiting it to balance the reward and the cost objectives.

πŸ”Ή Learning to be Safe: Deep RL with a Safety Critic 😢

We propose to learn how to be safe in one set of tasks and environments, and then use that learned intuition to constrain future behaviors when learning new, modified tasks.

πŸ”Ή CONSERVATIVE SAFETY CRITICS FOR EXPLORATION πŸ”₯ πŸŒ‹ πŸ’₯

CSC: we target the problem of safe exploration in RL by learning a conservative safety estimate of environment states through a critic, and provably upper bound the likelihood of catastrophic failures at every training iteration.

πŸ”Ή Conservative Distributional Reinforcement Learning with Safety Constraints πŸ”₯

We propose the CDMPO algorithm to solve safety-constrained RL problems. Our method incorporates a conservative exploration strategy as well as a conservative distribution function. CSC + distributional RL + MPO + WAPID

πŸ”Ή CUP: A Conservative Update Policy Algorithm for Safe Reinforcement Learning πŸ‘ πŸ”₯ πŸŒ‹ πŸ’₯

(i) We provide a rigorous theoretical analysis to extend the surrogate functions to generalized advantage estimator (GAE). GAE significantly reduces variance empirically while maintaining a tolerable level of bias, which is an efficient step for us to design CUP; (ii) The proposed bounds are tighter than existing works, i.e., using the proposed bounds as surrogate functions are better local approximations to the objective and safety constraints. (iii) The proposed CUP provides a non-convex implementation via first-order optimizers, which does not depend on any convex approximation.

πŸ”Ή Constrained Variational Policy Optimization for Safe Reinforcement Learning πŸ”₯ πŸ’§

CVPO: [poster]

πŸ”Ή A Review of Safe Reinforcement Learning: Methods, Theory and Applications πŸ’¦

πŸ”Ή MESA: Offline Meta-RL for Safe Adaptation and Fault Tolerance 😢

We cast safe exploration as an offline metaRL problem, where the objective is to leverage examples of safe and unsafe behavior across a range of environments to quickly adapt learned risk measures to a new environment with previously unseen dynamics.

πŸ”Ή Safe Driving via Expert Guided Policy Optimization πŸ‘ πŸ”₯

We develop a novel EGPO method which integrates the guardian in the loop of reinforcement learning. The guardian is composed of an expert policy to generate demonstration and a switch function to decide when to intervene.

πŸ”Ή EFFICIENT LEARNING OF SAFE DRIVING POLICY VIA HUMAN-AI COPILOT OPTIMIZATION πŸ”₯

Human-AI Copilot Optimization (HACO): Human can take over the control and demonstrate to the agent how to avoid probably dangerous situations or trivial behaviors.

πŸ”Ή SAFER: DATA-EFFICIENT AND SAFE REINFORCEMENT LEARNING THROUGH SKILL ACQUISITION πŸ”₯

We propose SAFEty skill pRiors, a behavioral prior learning algorithm that accelerates policy learning on complex control tasks, under safety constraints. Through principled contrastive training on safe and unsafe data, SAFER learns to extract a safety variable from offline data that encodes safety requirements, as well as the safe primitive skills over abstract actions in different scenarios.

πŸ”Ή Sim-to-Lab-to-Real: Safe Reinforcement Learning with Shielding and Generalization Guarantees πŸ”₯

We propose the Sim-to-Lab-to-Real framework that combines Hamilton-Jacobi reachability analysis and PAC-Bayes generalization guarantees to safely close the sim2real gap. Joint training of a performance and a backup policy in Sim training (1st stage) ensures safe exploration during Lab training (2nd stage).

πŸ”Ή Reachability Constrained Reinforcement Learning πŸŒ‹

this paper proposes the reachability CRL (RCRL) method by using reachability analysis to establish the novel self-consistency condition and characterize the feasible sets. The feasible sets are represented by the safety value function.

πŸ”Ή Robust psi-Divergence MDPs πŸ”₯

we develop a novel solution framework for robust MDPs with s-rectangular ambiguity sets that decomposes the problem into a sequence of robust Bellman updates and simplex projections.

Multi-Objective RL:

πŸ”Ή Offline Constrained Multi-Objective Reinforcement Learning via Pessimistic Dual Value Iteration πŸ”₯

πŸ”Ή Optimistic Linear Support and Successor Features as a Basis for Optimal Policy Transfer πŸ”₯ πŸ”₯

We showed that any transfer learning problem within the SF framework can be mapped into an equivalent problem of learning multiple policies in MORL under linear preferences. We then introduced a novel SF-based extension of the OLS algorithm (SFOLS) to iteratively construct a set of policies whose SFs form a CCS. [poster]

πŸ”Ή Q-PENSIEVE: BOOSTING SAMPLE EFFICIENCY OF MULTI-OBJECTIVE RL THROUGH MEMORY SHARING OF Q-SNAPSHOTS πŸ”₯ πŸ‘

we propose Q-Pensieve, a policy improvement scheme that stores a collection of Q-snapshots to jointly determine the policy update direction and thereby enables data sharing at the policy level.

πŸ”Ή A Generalized Algorithm for Multi-Objective Reinforcement Learning and Policy Adaptation πŸ”₯ πŸŒ‹ πŸ’₯

Envelope MOQ-Learning: We propose a generalized version of the Bellman equation to learn a single parametric representation for optimal policies over the space of all possible preferences.

πŸ”Ή PD-MORL: PREFERENCE-DRIVEN MULTIOBJECTIVE REINFORCEMENT LEARNING ALGORITHM πŸ”₯ πŸŒ‹

We observe that the preference vectors have similar directional angles to the corresponding vectorized Q-values for a given state. Using the insight, we utilize the cosine similarity between the preference vector and vectorized Q-values in the Bellman’s optimality operator to guide the training.

πŸ”Ή Trust Region Policy Optimization with Optimal Transport Discrepancies: Duality and Algorithm for Continuous Actions πŸ˜•

we explore optimal transport discrepancies (which include the Wasserstein distance) to define trust regions, and we propose a novel algorithm – Optimal Transport Trust Region Policy Optimization (OT-TRPO) – for continuous state-action spaces. We circumvent the infinite-dimensional optimization problem for PO by providing a one-dimensional dual reformulation for which strong duality holds.

Distributional RL

Continual Learning

πŸ”Ή Continual Learning with Deep Generative Replay πŸ’§ 😢

We propose the Deep Generative Replay, a novel framework with a cooperative dual model architecture consisting of a deep generative model (β€œgenerator”) and a task solving model (β€œsolver”).

πŸ”Ή online learning; regret πŸ’¦ ​

πŸ”Ή RESET-FREE LIFELONG LEARNING WITH SKILL-SPACE PLANNING πŸ”₯

We propose Lifelong Skill Planning (LiSP), an algorithmic framework for non-episodic lifelong RL based on planning in an abstract space of higher-order skills. We learn the skills in an unsupervised manner using intrinsic rewards and plan over the learned skills using a learned dynamics model.

πŸ”Ή Don’t Start From Scratch: Leveraging Prior Data to Automate Robotic Reinforcement Learning πŸ‘

Our main contribution is demonstrating that incorporating prior data into a reinforcement learning system simultaneously addresses several key challenges in real-world robotic RL: sample-efficiency, zero-shot generalization, and autonomous non-episodic learning.

πŸ”Ή A State-Distribution Matching Approach to Non-Episodic Reinforcement Learning πŸ‘ πŸ”₯

Assuming access to a few demonstrations, we propose a new method, MEDAL, that trains the backward policy to match the state distribution in the provided demonstrations. [poster]

πŸ”Ή You Only Live Once: Single-Life Reinforcement Learning via Learned Reward Shaping πŸ”₯ πŸŒ‹

SLRL. (QWALE) that addresses the dearth of supervision by employing a distribution matching strategy that leverages the agent’s prior experience as guidance in novel situations.

πŸ”Ή FULLY ONLINE META-LEARNING WITHOUT TASK BOUNDARIES πŸ”₯

we propose a Fully Online MetaLearning (FOML) algorithm, which does not require any ground truth knowledge about the task boundaries and stays fully online without resetting back to pre-trained weights.

πŸ”Ή Learn the Time to Learn: Replay Scheduling in Continual Learning πŸ‘

Storing historical data is cheap in many real-world applications, yet replaying all historical data would be prohibited due to processing time constraints. In such settings, we propose learning the time to learn for a continual learning system, in which we learn replay schedules over which tasks to replay at different time steps.

Self-paced & Curriculum RL

πŸ”Ή Self-Paced Contextual Reinforcement Learning πŸŒ‹ πŸ’¦ ​ ​

We introduce a novel relative entropy reinforcement learning algorithm that gives the agent the freedom to control the intermediate task distribution, allowing for its gradual progression towards the target context distribution.

πŸ”Ή Self-Paced Deep Reinforcement Learning πŸŒ‹ πŸ’¦ ​ ​

In this paper, we propose an answer by interpreting the curriculum generation as an inference problem, where distributions over tasks are progressively learned to approach the target task. This approach leads to an automatic curriculum generation, whose pace is controlled by the agent, with solid theoretical motivation and easily integrated with deep RL algorithms.

πŸ”ΉLearning with AMIGO: Adversarially Motivated Intrinsic Goals πŸ‘ Lil'Log-Curriculum πŸ‘ ​

(Intrinsic motivation + Curriculum learning)

πŸ”Ή Information Directed Reward Learning for Reinforcement Learning πŸ”₯ πŸŒ‹ πŸ’₯

IDRL: uses a Bayesian model of the reward and selects queries that maximize the information gain about the difference in return between plausibly optimal policies.

πŸ”Ή Actively Learning Costly Reward Functions for Reinforcement Learning 😢

ACRL, an extension to standard reinforcement learning methods in the context of (computationally) expensive rewards, which models the reward of given applications using machine learning models.

Foundation models

πŸ”Ή Human-Timescale Adaptation in an Open-Ended Task Space πŸ”₯ πŸŒ‹

AdA: Adaptation emerges from three ingredients: (1) meta-reinforcement learning across a vast, smooth and diverse task distribution, (2) a policy parameterised as a large-scale attention-based memory architecture, and (3) an effective automated curriculum that prioritises tasks at the frontier of an agent’s capabilities.

πŸ”Ή VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models πŸ”₯ πŸ”₯

VOXPOSER extracts language-conditioned affordances and constraints from LLMs and grounds them to the perceptual space using VLMs, using a code interface and without additional training to either component.

πŸ”Ή RoCo: Dialectic Multi-Robot Collaboration with Large Language Models πŸ”₯

πŸ”Ή REWARD DESIGN WITH LANGUAGE MODELS πŸ”₯

πŸ”Ή GPT-4 Technical Report

πŸ”Ή Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

πŸ”Ή Rewarded soups: towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards

πŸ”Ή Language to Rewards for Robotic Skill Synthesis πŸ”₯ πŸŒ‹ πŸ’₯

Using reward as the intermediate interface generated by LLMs, we can effectively bridge the gap between high-level language instructions or corrections to low-level robot actions.

πŸ”Ή Code as Policies: Language Model Programs for Embodied Control πŸ”₯

Given examples (via few-shot prompting), robots can use code-writing large language models (LLMs) to translate natural language commands into robot policy code which process perception outputs, parameterize control primitives, recursively generate code for undefined functions, and generalize to new tasks.sss

πŸ”Ή β€œNo, to the Right” – Online Language Corrections for Robotic Manipulation via Shared Autonomy 😢

Language-Informed Latent Actions with Corrections (LILAC)

πŸ”Ή Lila: Language-informed latent actions πŸ”₯

LILA learns to use language to modulate this controller, providing users with a language-informed control space: given an instruction like β€œplace the cereal bowl on the tray,” LILA may learn a 2-DoF space where one dimension controls the distance from the robot’s end-effector to the bowl, and the other dimension controls the robot’s end-effector pose relative to the grasp point on the bowl.

Quadruped

Optimization

Galaxy Forest

​ 🌌 ❄️ πŸŒ€ 🌊 πŸŒ‹ 🌍 🌎 🌏 πŸ“– 🎯 πŸ’Ž πŸ‹ 🎧 πŸ“Œ πŸ›°οΈ πŸ“‘ πŸš€ 🌠 πŸŒ„ 🚩 🍺 🍡 πŸ“… β›³ βŒ› πŸ“· πŸ“Ÿ 🎈 πŸ† 🍎 🍚 ​

Aha

Alpha



Blog & Corp. & Legend

OpenAI Spinning Up, OpenAI Blog, OpenAI Baselines, DeepMind, BAIR, Stanford AI Lab,

Lil'Log, Andrej Karpathy blog, The Gradient, RAIL - course - RL, RAIL - cs285, inFERENCe,

covariant, RL_theory_book,

UCB: Tuomas Haarnoja, Pieter Abbeel, Sergey Levine, Abhishek Gupta, Coline Devin, YuXuan (Andrew) Liu, Rein Houthooft, Glen Berseth,

UCSD: Xiaolong Wang,

CMU: Benjamin Eysenbach, Ruslan Salakhutdinov,

Standord: Chelsea Finn, [Tengyu Ma], [Tianhe Yu], [Rui Shu],

NYU: Rob Fergus,

MIT: Bhairav Mehta, Leslie Kaelbling, Joseph J. Lim,

Caltech: Joseph Marino, Yisong Yue Homepage,

DeepMind: David Silver, Yee Whye Teh [Homepage], Alexandre Galashov, Leonard Hasenclever [GS], Siddhant M. Jayakumar, Zhongwen Xu, Markus Wulfmeier [HomePage], Wojciech Zaremba, Aviral Kumar,

Google: Ian Fischer, Danijar Hafner [Homepage], Ofir Nachum, Yinlam Chow, Shixiang Shane Gu, [Mohammad Ghavamzadeh]

Montreal: Anirudh Goyal Homepage,

Toronto: Jimmy Ba; Amir-massoud Farahmand;

Columbia: Yunhao (Robin) Tang,

OpenAI:

THU: Chongjie Zhang [Homepage], Yi Wu, Mingsheng Long [Homepage],

PKU: Zongqing Lu,

NJU: Yang Yu,

TJU: Jianye Hao,

HIT: PR&IS research center,

Salesforce : Alexander Trott,

Flowers Lab (INRIA): CΓ©dric Colas,

NeurIPS,

Releases

No releases published

Packages

No packages published