SAC for Mujoco Environments (#6642)

ray-project · Dec 31, 2019 · 1cb3354 · 1cb3354
1 parent cdc1ce4
commit 1cb3354
Show file tree

Hide file tree

Showing 3 changed files with 48 additions and 7 deletions.
diff --git a/doc/source/rllib-algorithms.rst b/doc/source/rllib-algorithms.rst
@@ -274,9 +274,17 @@ Soft Actor Critic (SAC)
 
  SAC architecture (same as DQN)
 
-RLlib's soft-actor critic implementation is ported from the `official SAC repo <https://github.com/rail-berkeley/softlearning>`__ to better integrate with RLlib APIs. Note that SAC has two fields to configure for custom models: ``policy_model`` and ``Q_model``, and currently has no support for non-continuous action distributions. It is also currently *experimental*.
+RLlib's soft-actor critic implementation is ported from the `official SAC repo <https://github.com/rail-berkeley/softlearning>`__ to better integrate with RLlib APIs. Note that SAC has two fields to configure for custom models: ``policy_model`` and ``Q_model``, and currently has no support for non-continuous action distributions.
 
-Tuned examples: `Pendulum-v0 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/regression_tests/pendulum-sac.yaml>`__
+Tuned examples: `Pendulum-v0 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/regression_tests/pendulum-sac.yaml>`__, `HalfCheetah-v3 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/halfcheetah-sac.yaml>`__
+
+**MuJoCo results @500k steps:** `more details <https://github.com/ray-project/rl-experiments>`__
+
+============= ========== ===================
+MuJoCo env RLlib SAC Haarnoja et al SAC
+============= ========== ===================
+HalfCheetah 8752 ~9000
+============= ========== ===================
 
 **SAC-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
 

diff --git a/rllib/agents/sac/sac_model.py b/rllib/agents/sac/sac_model.py
@@ -81,7 +81,6 @@ def __init__(self,
  shape=(num_outputs, ), name="model_out")
  self.actions = tf.keras.layers.Input(
  shape=(self.action_dim, ), name="actions")
-
  shift_and_log_scale_diag = tf.keras.Sequential([
  tf.keras.layers.Dense(
  units=hidden,
@@ -90,10 +89,7 @@ def __init__(self,
  for i, hidden in enumerate(actor_hiddens)
  ] + [
  tf.keras.layers.Dense(
- units=tfp.layers.MultivariateNormalTriL.params_size(
- self.action_dim),
- activation=None,
- name="action_out")
+ units=2 * self.action_dim, activation=None, name="action_out")
  ])(self.model_out)
 
  shift, log_scale_diag = tf.keras.layers.Lambda(

diff --git a/rllib/tuned_examples/halfcheetah-sac.yaml b/rllib/tuned_examples/halfcheetah-sac.yaml
@@ -0,0 +1,37 @@
+# Our implementation of SAC can reach 9k reward in 400k timesteps
+halfcheetah_sac:
+ env: HalfCheetah-v3
+ run: SAC
+ stop:
+ episode_reward_mean: 9000
+ config:
+ horizon: 1000
+ soft_horizon: False
+ Q_model:
+ hidden_activation: relu
+ hidden_layer_sizes: [256, 256]
+ policy_model:
+ hidden_activation: relu
+ hidden_layer_sizes: [256, 256]
+ tau: 0.005
+ target_entropy: auto
+ no_done_at_end: True
+ n_step: 1
+ sample_batch_size: 1
+ prioritized_replay: False
+ train_batch_size: 256
+ target_network_update_freq: 1
+ timesteps_per_iteration: 1000
+ learning_starts: 10000
+ exploration_enabled: True
+ optimization:
+ actor_learning_rate: 0.0003
+ critic_learning_rate: 0.0003
+ entropy_learning_rate: 0.0003
+ num_workers: 0
+ num_gpus: 0
+ clip_actions: False
+ normalize_actions: True
+ evaluation_interval: 1
+ metrics_smoothing_episodes: 5
+