[bug] PPO2 episode reward summaries are written incorrectly for VecEnvs #143

shwang · 2018-12-22T03:10:30Z

Episode reward summaries are all concentrated together on a few steps, with jumps in between.

Zoomed out:

Zoomed in:

Every other summary looks fine:

To reproduce, run PPO2 on DummyVecEnv(["Pendulum-v0" for _ in range(8)]).

The text was updated successfully, but these errors were encountered:

araffin · 2019-01-09T09:59:18Z

Hello,
I have also encountered that issue in the past... I did not investigate a lot but I think I found that came from using multiple environments.
Could you run two experiments that could provide some insights:

same experiment but with only one env
same experiment but with A2C (even though A2C does not work yet in term of performance on continuous actions)

ernestum · 2019-01-13T17:24:55Z

I can confirm that it works with just one env. The relevant code is in total_episode_reward_logger which is called by PPO2 here and . To me it is absolutely unclear what total_episode_reward_logger is doing exactly and I have no time to look into the issue unfortunately.

balintkozma · 2019-11-12T15:48:15Z

The root cause of this problem can be related to this:

total_episode_reward_logger is "borrowed" from the A2C module, and used incorrectly in PPO2.
It calculates the step counter of the add_summary call by adding the length of the episode to the current self.num_timetsteps variable.
It will be correct only, if self.num_timesteps += self.n_batch is called after total_episode_reward_logger like in A2C. If that line is called before the logger, it will shift the step counter by n_bacth.

Miffyli · 2019-11-14T14:06:38Z

@balintkozma
Hey. Could you make a pull request out of this? On a brief study it looks like one mostly has to move this block before self.num_timesteps increment. Rest of the variables seem to be fine with this modification

balintkozma · 2019-11-14T16:15:53Z

@Miffyli

I created the PR, meanwhile I found another problem:

If a shorter episode is added to the episode reward summary after a longer one, the graph will go backwards, because tensorboard connects the dots in the order they are added, and the calculated step counter will be smaller.

So, the curves on the zoomed-in picture can be avoided by sorting the finished episodes by length before they are added to the tensorboard summary.

Not implemented yet, I will create a separete PR.

araffin · 2019-11-14T16:17:16Z

Not implemented yet, I will create a separete PR.

Please do only one PR that solves this issue.

Miffyli · 2019-11-14T16:17:26Z

@balintkozma

Thanks for the quick reply!

I think that could also be fixed in the same PR, as these two are relate-...

Ninj'd by Arrafin

7Z0nE · 2020-02-11T08:49:51Z

There are much more issues with the timestep computation than just the call to total_episode_reward_logger. All other plots are also wrong when running a VecEnv.
Orange: multiple environment, Red: single environment

I do not understand the computation of the current timestep:

timestep = self.num_timesteps // update_fac + ((self.noptepochs * self.n_batch + epoch_num * self.n_batch + start) // batch_size)

Am I missing something? To me the only requirement to the timestep computation is that the values are plotted in the same order as they were computed.

I already fixed this for my own use and would make a pull request if appreciated.

Miffyli · 2020-02-11T17:14:04Z

I already fixed this for my own use and would make a pull request if appreciated.

If you have a solution that does not change too many parts at once, go ahead and make a PR out of it :). If it is a large change it might need time/discussion before merge, as we (try) to focus on v3.0 at the moment.

araffin · 2020-02-11T22:04:42Z

Am I missing something? To me the only requirement to the timestep computation is that the values are plotted in the same order as they were computed.

Looking at the issue again, the computation of timestep does not make really sense. A real fix would be to plot the average of thoses values instead of plotting each one of those...

paolo-viceconte · 2020-03-11T19:09:30Z

Hi, I also encountered some issues described in the comments above. A recap follows.

PPO2 tensorboard visualization issues

If you run ppo2 with a single process training for 256 timesteps (N=1, T=256) and try to visualize the episode reward and the optimization statistics:

the episode_reward is shifted of T (instead of being in [0,256], it is plotted in [256,512]) for the reason explained in [bug] PPO2 episode reward summaries are written incorrectly for VecEnvs #143 (comment)
the loss statistics are associated with weird timesteps (i.e. [527,782]) obtained as a result of the timestep calculations highlighted in [bug] PPO2 episode reward summaries are written incorrectly for VecEnvs #143 (comment)

Moreover, if you try to plot data using multiple processes (for instance N=4 workers with T=256 timesteps per worker):

the collected reward are superposed in the first T timesteps followed by a jump of (N-1)*T timesteps in the plot

PPO tensorboard visualization proposed solution

I implemented the following solutions for the visualization issues:

decreasing the timesteps index by the batch size before plotting
simplifying the logic for plotting the optimization statistics:
- each optimization is made of K epochs on N*T//M minibatches (being M the training timesteps related to a minibatch), therefore a fixed number of data is collected during the optimization, namely K * N*T//M
- in order to retain visual comparison of the episode reward and the optimization statistics, the K * N*T//M optimization data are equally distributed over the batch size N*T
adding an offset for each process

As a result, in the showcases above:

the episode_reward is correctly plotted [0,256]
the loss statistics are plotted in [0,256] as well, equally distributed

the rewards collected by the N workers are plotted side by side

The modifications are just a few and straightforward. Regarding the side-by-side visualization of the rewards in the multiprocess case, do you believe that plotting the mean and variance of the collected data would instead be more appropriate?

If it is appreciated, I would open a PR with the implemented modifications, which I can update if the mean and variance solution is recommended.

araffin · 2020-03-15T20:24:55Z

@paolo-viceconte thanks, I'll try to take a look at what you did this week (unless @Miffyli can do it before), we have too many issue related to that function (cf all linked issues).

Capitolhill · 2021-01-04T10:46:04Z

Hi,

I have been facing problems with diagnosing PPO2 training on multiple environments. Especially the episode reward are weird (see the image).

Today, I chanced to read this issue. Is there an easy way to resolve these logging problems?

Miffyli · 2021-01-04T12:59:36Z

@Capitolhill As a quick fix, I suggest trying out stable-baselines3 which also has tensorboard support and is more actively maintained. Migration from SB2 is mostly as simple as replacing stable_baselines with stable_baselines3 in your code, and for more troublesome cases we have a migration guide here.

Capitolhill · 2021-01-04T14:21:14Z

@Capitolhill As a quick fix, I suggest trying out stable-baselines3 which also has tensorboard support and is more actively maintained. Migration from SB2 is mostly as simple as replacing stable_baselines with stable_baselines3 in your code, and for more troublesome cases we have a migration guide here.

Thanks for the response. I am assuming the tensorboard logging issue for multiple-envs. has been resolved in SB3.

araffin · 2021-01-05T16:21:02Z

Thanks for the response. I am assuming the tensorboard logging issue for multiple-envs. has been resolved in SB3.

For a more complete answer, you can use the "legacy" logging in SB2 (cf doc) but it requires the use of the Monitor wrapper.
In SB3, the Monitor wrapper method is now the default and solves that issue ;)

araffin added the bug Something isn't working label Mar 8, 2019

araffin mentioned this issue Mar 8, 2019

Wrong episode_reward graph when using LSTM #224

Closed

araffin mentioned this issue Mar 19, 2019

Possible problems in total_episode_reward_logger #236

Open

araffin mentioned this issue May 14, 2019

[question] How exactly does the multiprocessing of PPO2 work? #322

Closed

araffin mentioned this issue Oct 23, 2019

Tensorboard plotting in PPO2 is broken #518

Closed

balintkozma mentioned this issue Nov 14, 2019

PPO2: Call total_episode_reward_logger before incrementing num_timesteps #556

Open

10 tasks

This was referenced Nov 23, 2019

V3.0 implementation design #576

Closed

Weird tensorboard plots in GAIL #577

Closed

araffin mentioned this issue Mar 6, 2020

Episode reward logged with SAC not correct #726

Closed

araffin mentioned this issue Jun 24, 2020

[question] How is the episode reward reported in tensorboard calculated #905

Closed

araffin mentioned this issue Jul 25, 2020

Changing hyper-parameters in PPO2 araffin/rl-baselines-zoo#94

Closed

This was referenced Jul 25, 2020

Using the monitor wrapper with ppo2 araffin/rl-baselines-zoo#95

Closed

Using custom wrappers while training models araffin/rl-baselines-zoo#96

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] PPO2 episode reward summaries are written incorrectly for VecEnvs #143

[bug] PPO2 episode reward summaries are written incorrectly for VecEnvs #143

shwang commented Dec 22, 2018 •

edited

Loading

araffin commented Jan 9, 2019

ernestum commented Jan 13, 2019

balintkozma commented Nov 12, 2019

Miffyli commented Nov 14, 2019

balintkozma commented Nov 14, 2019

araffin commented Nov 14, 2019

Miffyli commented Nov 14, 2019

7Z0nE commented Feb 11, 2020

Miffyli commented Feb 11, 2020 •

edited

Loading

araffin commented Feb 11, 2020

paolo-viceconte commented Mar 11, 2020 •

edited

Loading

araffin commented Mar 15, 2020

Capitolhill commented Jan 4, 2021

Miffyli commented Jan 4, 2021

Capitolhill commented Jan 4, 2021

araffin commented Jan 5, 2021

[bug] PPO2 episode reward summaries are written incorrectly for VecEnvs #143

[bug] PPO2 episode reward summaries are written incorrectly for VecEnvs #143

Comments

shwang commented Dec 22, 2018 • edited Loading

araffin commented Jan 9, 2019

ernestum commented Jan 13, 2019

balintkozma commented Nov 12, 2019

Miffyli commented Nov 14, 2019

balintkozma commented Nov 14, 2019

araffin commented Nov 14, 2019

Miffyli commented Nov 14, 2019

7Z0nE commented Feb 11, 2020

Miffyli commented Feb 11, 2020 • edited Loading

araffin commented Feb 11, 2020

paolo-viceconte commented Mar 11, 2020 • edited Loading

PPO2 tensorboard visualization issues

PPO tensorboard visualization proposed solution

araffin commented Mar 15, 2020

Capitolhill commented Jan 4, 2021

Miffyli commented Jan 4, 2021

Capitolhill commented Jan 4, 2021

araffin commented Jan 5, 2021

shwang commented Dec 22, 2018 •

edited

Loading

Miffyli commented Feb 11, 2020 •

edited

Loading

paolo-viceconte commented Mar 11, 2020 •

edited

Loading