-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib] APPO+new-stack (Atari benchmark) - Preparatory PR 01 #34743
[RLlib] APPO+new-stack (Atari benchmark) - Preparatory PR 01 #34743
Conversation
Signed-off-by: sven1977 <[email protected]>
@@ -171,6 +171,16 @@ py_test( | |||
args = ["--dir=tuned_examples/appo"] | |||
) | |||
|
|||
py_test( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this learning test for the new stack.
@@ -13,19 +13,6 @@ | |||
torch, nn = try_import_torch() | |||
|
|||
|
|||
def get_ppo_loss(fwd_in, fwd_out): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not used anywhere in the code anymore: Removed! :)
@@ -522,14 +522,14 @@ def compile_results( | |||
|
|||
# We put the stats for all modules under the ALL_MODULES key. e.g. average of | |||
# the gradients across all modules will go here. | |||
mean_grads = [ | |||
np.mean(grad) | |||
mean_abs_grads = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MEAN(ABS(grads)) is a much better metric than MEAN(grads) as the latter is always very close to zero, regardless of the nature of the gradients.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what we should actually probably do here is the gradient norms. We can stick with this for now as it is similar, but in the future, norm is probably the way to go.
@@ -864,8 +864,6 @@ def _compute_actions_helper( | |||
dist_inputs = None | |||
|
|||
elif is_overridden(self.action_sampler_fn): | |||
dist_inputs = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these vars were not used. Not sure how this survived our LINTer for so long.
|
||
for target in self.target_models.values(): | ||
target.load_state_dict(model_state_dict) | ||
if self.config.get("_enable_rl_module_api", False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preparation for the upcoming move of APPO (torch) to the new API stack.
@@ -1,31 +0,0 @@ | |||
cartpole-appo-learner: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gave this file a better, more descriptive/accurate name.
@@ -232,8 +232,8 @@ class _ActorState: | |||
def __init__( | |||
self, | |||
actors: Optional[List[ActorHandle]] = None, | |||
max_remote_requests_in_flight_per_actor: Optional[int] = 2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are not optional. They must be ints. Yes, they have default (int) values, but cannot be None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some questions, ty for this cleanup
@@ -522,14 +522,14 @@ def compile_results( | |||
|
|||
# We put the stats for all modules under the ALL_MODULES key. e.g. average of | |||
# the gradients across all modules will go here. | |||
mean_grads = [ | |||
np.mean(grad) | |||
mean_abs_grads = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what we should actually probably do here is the gradient norms. We can stick with this for now as it is similar, but in the future, norm is probably the way to go.
rllib/core/learner/learner.py
Outdated
|
||
Returns: | ||
The constructed module. | ||
The constructed MultiAgentRLModule. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit not necessary to address: A constructed MultiAgentRLModule
rllib/core/learner/scaling_config.py
Outdated
@@ -13,7 +14,8 @@ class LearnerGroupScalingConfig: | |||
training will run on a single CPU. | |||
num_gpus_per_worker: The number of GPUs to allocate per worker. If | |||
num_workers=0, any number greater than 0 will run the training on a single | |||
GPU. A value of zero will run the training on a single CPU. | |||
GPU. A value of zero will run the training on a single CPU. Fractional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fractional values aren't really supported. Fractional gpu ussage will cause cuda aync errors AFAICT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a value of zero will run the traing on num_cpus_per worker CPUs, not a single cpu.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, will fix. ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, there is one test, where we do set num_gpus_per_worker to 0.5:
In test_learner_group.py
:
LOCAL_SCALING_CONFIGS = {
"local-cpu": LearnerGroupScalingConfig(num_workers=0, num_gpus_per_worker=0),
"local-gpu": LearnerGroupScalingConfig(num_workers=0, num_gpus_per_worker=0.5),
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's why I fixed the typehint, but yeah, it makes sense that torch/tf DDP wouldn't like it. I fixed the comment and said it's not allowed due to xyz.
vtrace: True | ||
use_kl_loss: False | ||
full_action_space: false | ||
repeat_action_probability: 0.0 # deterministic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm what is this. What is the repeat action probability by default and how much does it matter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Repeat action prob by default is 025, meaning that in 25% of all Env.step(action)
calls, the env doesn't actually apply the given action, but repeats the previous one. This means that by default, Atari envs are stochastic, but our benchmarks always used the deterministic setting.
See here for more details: https://gymnasium.farama.org/environments/atari/#version-history-and-naming-schemes
We used to run against: "PongNoFrameskip-v4", where the v4 indicates repeat_action_probability=0.0 and the NoFrameskip means frameskip=0.
full_action_space: false | ||
repeat_action_probability: 0.0 # deterministic | ||
vtrace: true | ||
use_kl_loss: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to disable the kl loss? Is this something that we were doing in the old stack?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed in the new benchmarks that KL-loss might destabilize the run, but this observation was made while several other issues had not been fixed so it could have been a red herring. Either way, we don't currently use kl_loss in our ols-stack benchmark, so I would like to keep it switched off until we have fully investigated a possible benefit of using this term.
env_config: | ||
frameskip: 1 # no frameskip | ||
frameskip: 1 | ||
full_action_space: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the full action space? Does disabling it blow up the action space significantly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See here: https://gymnasium.farama.org/environments/atari/#action-space
The default is False
anyways, so no need to really set this here, but for explicitness, I added this as well.
num_gpus_per_learner_worker: 0 | ||
num_cpus_per_learner_worker: 1 | ||
|
||
_enable_rl_module_api: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nit: can you put _enable_xxx flags close to each other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
# Run with Learner API. | ||
_enable_learner_api: true | ||
grad_clip_by_global_norm: 10.0 | ||
# Use a single Learner worker on the GPU. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the comment doesn't match the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
…stack_appo_atari_benchmark_01
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
…oject#34743) Signed-off-by: Jack He <[email protected]>
APPO+new-stack (Atari benchmark) - Preparatory PR 01
This is a breakdown PR (1st of N) from this one here, which is too large to merge: #34363
Why are these changes needed?
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.