DDPG log output uses scientific notation too soon for episodes #545

jkterry1 · 2019-11-08T01:57:40Z

Here's an example intermittent print out from DDPG:

--------------------------------------
| reference_Q_mean        | 49.8     |
| reference_Q_std         | 6.61     |
| reference_action_mean   | -0.621   |
| reference_action_std    | 0.752    |
| reference_actor_Q_mean  | 50.4     |
| reference_actor_Q_std   | 6.42     |
| rollout/Q_mean          | 46.9     |
| rollout/actions_mean    | 0.137    |
| rollout/actions_std     | 0.806    |
| rollout/episode_steps   | 178      |
| rollout/episodes        | 2.13e+03 |
| rollout/return          | 82.2     |
| rollout/return_history  | 92       |
| total/duration          | 828      |
| total/episodes          | 2.13e+03 |
| total/epochs            | 1        |
| total/steps             | 379998   |
| total/steps_per_second  | 459      |
| train/loss_actor        | -69.6    |
| train/loss_critic       | 1.11     |
| train/param_noise_di... | 0        |
--------------------------------------

For the rollout and total episodes, they're represented in scientific notation, which is harder to read at a glance, even though they're not long enough to require it. You can see that it makes the values longer than they otherwise would be, and that for say total/steps a longer number is represented there. Also none of the other RL methods do this. I know this is minor, but it's annoying enough to be worth fixing. Can someone please look into it?

Code to reproduce:

import gym
import numpy as np
import warnings
import os
from stable_baselines.results_plotter import load_results, ts2xy

best_mean_reward, n_steps = -np.inf, 0

def callback(_locals, _globals):
  """
  Callback called at each step (for DQN an others) or after n steps (see ACER or PPO2)
  :param _locals: (dict)
  :param _globals: (dict)
  """
  global n_steps, best_mean_reward
  # Print stats every 1000 calls
  if (n_steps + 1) % 1000 == 0:
      # Evaluate policy training performance
      x, y = ts2xy(load_results(log_dir), 'timesteps')
      if len(x) > 0:
          mean_reward = np.mean(y[-100:])
          """
          print(x[-1], 'timesteps')
          print("Best mean reward: {:.2f} - Last mean reward per episode: {:.2f}".format(best_mean_reward, mean_reward))
          """

          # New best model, you could save the agent here
          if mean_reward > best_mean_reward:
              best_mean_reward = mean_reward
              # Example for saving best model
              print("Saving new best model")
              _locals['self'].save(log_dir + 'best_model.pkl')
  n_steps += 1
  return True


log_dir = "/tmp/gym/"
os.makedirs(log_dir, exist_ok=True)


from stable_baselines.ddpg.policies import MlpPolicy
from stable_baselines.common.vec_env import VecVideoRecorder, DummyVecEnv
from stable_baselines.ddpg.noise import NormalActionNoise, OrnsteinUhlenbeckActionNoise, AdaptiveParamNoiseSpec
from stable_baselines import DDPG
from stable_baselines.bench import Monitor

env = gym.make('MountainCarContinuous-v0')
env.seed(42)
env = Monitor(env, log_dir, allow_early_resets=True)

n_actions = env.action_space.shape[-1]
param_noise = None
action_noise = OrnsteinUhlenbeckActionNoise(mean=np.zeros(n_actions), sigma=float(0.5) * np.ones(n_actions))

model = DDPG(MlpPolicy, env, verbose=1, param_noise=param_noise, action_noise=action_noise, tensorboard_log='./ddpg_tensorboard')
model.learn(total_timesteps=400000, callback=callback)

obs = env.reset()

while True:
    action, _states = model.predict(obs)
    obs, rewards, dones, info = env.step(action)
    env.render()
env.close()

The text was updated successfully, but these errors were encountered:

araffin · 2019-11-08T16:11:29Z

Hello,

I think this display may come from this line where the number of episodes becomes a float.
Feel free to submit a PR if you think this is useful. I consider this as really minor and would recommend you to use SAC or TD3 anyway.

araffin added the enhancement New feature or request label Nov 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DDPG log output uses scientific notation too soon for episodes #545

DDPG log output uses scientific notation too soon for episodes #545

jkterry1 commented Nov 8, 2019 •

edited

Loading

araffin commented Nov 8, 2019

DDPG log output uses scientific notation too soon for episodes #545

DDPG log output uses scientific notation too soon for episodes #545

Comments

jkterry1 commented Nov 8, 2019 • edited Loading

araffin commented Nov 8, 2019

jkterry1 commented Nov 8, 2019 •

edited

Loading