Behavioral Priors and Dynamics Models:

Improving Performance and Domain Transfer in Offline RL

Catherine Cang, Aravind Rajeswaran, Pieter Abbeel, Michael Laskin

UC Berkeley, Facebook AI Research


We introduce Offline Model-based RL with Adaptive Behavioral Priors (MABE). Our alrgorithm is based on the finding that dynamics models, which support within-domain generalization, and behavioral priors, which support cross-domain generalization, are complementary. When combined together, they substantially improve the performance and generalization of offline RL policies.

Motivation

Offline Reinforcement Learning (RL) aims to extract near-optimal policies from imperfect offline data without additional environment interactions. Extracting policies from diverse offline datasets has the potential to expand the range of applicability of RL by making the training process safer, faster, and more streamlined. We investigate how to improve the performance of offline RL algorithms, its robustness to the quality of offline data, as well as its generalization capabilities.

Highlights

1) MABE achieves the highest score in 7 out of 9 D4RL benchmark tasks we study compared, as well as the highest average score, when compared to prior work.

2) MABE is a model-based algorithm that does not require uncertainty estimation, and thus has the potential for wider applicability. MABE can also optionally benefit from uncertainty estimation when available or possible, although the gains from uncertainty estimation are marginal.

3) MABE has the unique capability of utilizing behavioral priors and dynamics models from different domains, thereby enabling effective cross-domain transfer.

Behavioral Priors with MBRL

Our method consists of three main components:

1) A neural network dynamics model of the environment.

2) An advantage-weighted behavioral prior model.

3) Policy improvement with model-based RL that also incorporates policy regularization towards the learned behavioral prior.

Our method offers behavioral regularization as an alternative to uncertainty estimation for robust model-based offline RL.

Results on D4RL Benchmark

To compare our method's performance, we present MABE's performance on D4RL, a common offline RL benchmark. We look at three agents: halfcheetah, hopper, and walker2d. For each agent, we evaluate on 3 different datasets of varying expertise: medium, mixed, and medium-expert. We compare against leading model-based and model-free methods, and find that MABE has a top score in 7 out of the 9 environments. Additionally, MABE has the highest average (normalized) score, with a score of 77.5.



Cross-Domain and Cross-Task Transfer

In prior work it was shown that offline MBRL is capable of in-domain generalization to new tasks. Here, we investigate cross-domain transfer capabilities of MABE and MOPO. We show that given multiple datasets with different dynamics and behaviors, we can successfully generalize to a new task in the target domain that was not present in the offline data for the target domain.

Our input data consists of two datasets: the agent walking forward on normal terrain, and the agent walking backwards on low-friction (or icy) terrain. Our target task is walking backwards on normal terrain. This combines the dynamics of the first dataset with the behavior of the second. Prior methods like MOPO can only utilize one of the datasets owing to the dynamics inconsistency between the normal and icy terrains. In contrast, MABE can simultaneously learn from both datasets.

In our experiments, across all three agents: halfcheetah, hopper, and walker2d, MABE is able to significantly outperform prior algorithms.

MOPO Domain Transfer

Normalized Score: -0.09

In MOPO domain transfer, we train MOPO on the backwards icy dataset.

MOPO Task Transfer

Normalized Score: 0.28

In MOPO task transfer, we train on the forward normal dataset.

MABE Transfer

Normalized Score: 0.78

In MABE transfer, we train the dynamics model on the forward normal dataset and train the behavioral prior on the backwards icy dataset.

Summary

  • We present a framework that outperforms existing offline methods across datasets of varying levels of expertise

  • We offer an alternative to uncertainty estimation, which can be very difficult in certain environments and datasets, in learning robust offline RL policies

  • We show the potential for multi-dataset transfer by using a separate dynamics model and behavioral prior