-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib] Learner API: Policies using RLModules (for sampler only) do not need loss/stats/mixins. #34445
[RLlib] Learner API: Policies using RLModules (for sampler only) do not need loss/stats/mixins. #34445
Conversation
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Im all for removing the loss functions but can we come up with tests for the compute loss functions while deleting these? |
Closing this for now as we do still need the loss functions in the RLMPolicies in case user runs with settings: enable_rl_module_api=True, but enable_learner_api=False. |
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left some comments.
rllib/algorithms/algorithm_config.py
Outdated
@@ -853,11 +853,15 @@ def validate(self) -> None: | |||
) | |||
|
|||
# Learner API requires RLModule API. | |||
if self._enable_learner_api and not self._enable_rl_module_api: | |||
if not ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about if self._enable_learner_api != self._enable_rl_module_api
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That'd probably work as well :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
||
check(learner_group_loss, policy_loss) | ||
learner_group.update(train_batch.as_multi_agent()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you think we'd still need this test? This test was just checking the RLM policy loss == learner_loss coming from learning group.update() I don't know if this test is relevant anymore with this PR. More important question is how should we test the learner Loss correctness for each algorithm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we just remove all xxx_policy_rlm.py files? this boiler plate code can be part of the base classes? EagerTF2 and torch_v2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, even better. Let's get rid of it. ...
Signed-off-by: sven1977 <[email protected]>
…ner_rlm_policies_simplifications
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
…ions' into learner_rlm_policies_simplifications
Signed-off-by: sven1977 <[email protected]>
self.enable_connectors = True | ||
|
||
# LR-schedule checking. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this check (due diligence). We are now doing a more proper learning rate scheduling, centrally, inside the core Learner class.
return APPOTorchPolicyWithRLModule | ||
else: | ||
from ray.rllib.algorithms.appo.appo_torch_policy import APPOTorchPolicy | ||
from ray.rllib.algorithms.appo.appo_torch_policy import APPOTorchPolicy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for special policy classes anymore
@@ -371,6 +371,27 @@ def validate(self) -> None: | |||
# Check `entropy_coeff` for correctness. | |||
if self.entropy_coeff < 0.0: | |||
raise ValueError("`entropy_coeff` must be >= 0.0!") | |||
# Entropy coeff schedule checking. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same checking for entropy coeff schedules. Scheduling is now handled centrally in the ImpalaLearner base class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is super similar to the lr schedule. should make this snippet of code shared between any type of scheduler and parametrize it in a way that can be used in both places?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, added a new, framework agnostic Scheduler API that handles validation, single-fixed values, and (piecewise) schedules. Used now for lr- and entropy schedules (possibly more soon as we add more algos to the new learner API).
self.curr_entropy_coeffs_per_module = defaultdict( | ||
lambda: self._get_tensor_variable(self.hps.entropy_coeff) | ||
lambda: self._get_tensor_variable(self.hps.entropy_coeff_schedule[0][1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a new concept:
For all schedules, lr, entropy, etc..
Do NOT use the non-schedule value anymore anywhere, not even for the initial value. Instead, the schedule is forced (via validate()) to have a 0-timestep entry, which defines the initial value. This is much more consistent and transparent for users: Either I use lr
OR I use lr_schedule
, but never both and there is no value-leakage between the two anymore. Proper validation code and descriptive error messages have been added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do we enforce that only can be specified? lr
vs lr_schedule
?
Please do not hard code 0/1 indices here. For better readability:
_, init_entropy_coeff = next(iter(self.hps.entropy_coeff_schedule))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment below. We check now in AlgorithmConfig (where it belongs).
I can add a comment explaining the [0][1], but I have to hard-code this! The first entry must start from ts=0 (to force more explicitness as to which value we start with). The definition of schedule configs is: List of tuples of the form ([ts], [lr to be reached at that ts])
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes I agree. but you don't need to hardcode [0][1]. The check is already happening in validate() of algorithm config. So here we are assuming that that test has already been checked. Now I need readability and therefore why
_, init_entropy_coeff = next(iter(self.hps.entropy_coeff_schedule))
is better than init_entropy_coeff = self.hps.entropy_coeff_schedule[0][1]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can classify this comment as nit :)
if not enable_learner_api: | ||
off_policy_ness = check_off_policyness(results, upper_limit=2.0) | ||
print(f"off-policy'ness={off_policy_ness}") | ||
# added back to the IMPALA Learner. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: @avnishn :)
@@ -325,6 +325,27 @@ def validate(self) -> None: | |||
# Check `entropy_coeff` for correctness. | |||
if self.entropy_coeff < 0.0: | |||
raise ValueError("`entropy_coeff` must be >= 0.0") | |||
# Entropy coeff schedule checking. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see IMPALA above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see, we should avoid duplicate code. This should just be a template for all schedulers. We will parametrize it such that it can be rendered for entropy scheduler vs. lr scheduler vs. foo scheduler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, added new Scheduler API for all these instances
@@ -25,7 +25,7 @@ | |||
|
|||
|
|||
class PPOTfLearner(PPOLearner, TfLearner): | |||
"""Implements tf-specific PPO loss logic on top of PPOLearner. | |||
"""Implements curr_entropy_coeffs_per_moduletf-specific PPO loss logic on top of PPOLearner. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo, will fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -179,6 +179,7 @@ def test_rl_module_api(self): | |||
.framework("torch") | |||
.rollouts(enable_connectors=True) | |||
.rl_module(_enable_rl_module_api=True) | |||
.training(_enable_learner_api=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had to fix a bunch of test cases that were still using learner=False, but RLModule=True. This is not possible anymore.
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dude this is amazing. Best PR of the week. A couple of comments and is good to go.
rllib/algorithms/algorithm_config.py
Outdated
self.enable_connectors = True | ||
|
||
# LR-schedule checking. | ||
# For the new Learner API stack, any schedule must start at ts 0 to avoid | ||
# ambiguity (user might think that the `lr` setting plays a role as well |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the sentence in parentheses don't make sense? what is it that user might think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what will happen to the regular lr that I pass in? will that get ignored?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New behavior:
- either
lr
is used (lr_schedule
is None). - or
lr_schedule
is used (not None), in which caselr
is ignored.
Old behavior:
example: lr=100 lr_schedule=[[1M, 0.001], [2M, 0.0003]]
-> for the first 1M steps, we use lr=100, then 0.001, linearly decreasing to 0.0003 (at ts=2M)
I find this behavior very confusing b/c it depends then on the timesteps I define in my schedule (if I define ts=0 in my schedule, then lr
is never used b/c the schedule "kicks" right in, but if I start my schedule definition at some ts >0, then lr
is used for the first n timesteps).
We could be even more strict and force users to set either one of the setting to None.
Only allowed: lr=None, lr_schedule is not None
OR lr is not None, lr_schedule=None
.
@@ -371,6 +371,27 @@ def validate(self) -> None: | |||
# Check `entropy_coeff` for correctness. | |||
if self.entropy_coeff < 0.0: | |||
raise ValueError("`entropy_coeff` must be >= 0.0!") | |||
# Entropy coeff schedule checking. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is super similar to the lr schedule. should make this snippet of code shared between any type of scheduler and parametrize it in a way that can be used in both places?
return ImpalaTF1Policy | ||
else: | ||
from ray.rllib.algorithms.a3c.a3c_tf_policy import A3CTFPolicy | ||
return A3CTFPolicy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we remove A3C from the algorithms? what would happen to impala?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. We can just copy it into a new Impala[Tf|Torch]Policy class.
self.curr_entropy_coeffs_per_module = defaultdict( | ||
lambda: self._get_tensor_variable(self.hps.entropy_coeff) | ||
lambda: self._get_tensor_variable(self.hps.entropy_coeff_schedule[0][1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do we enforce that only can be specified? lr
vs lr_schedule
?
Please do not hard code 0/1 indices here. For better readability:
_, init_entropy_coeff = next(iter(self.hps.entropy_coeff_schedule))
@@ -104,7 +109,9 @@ def __init__( | |||
) | |||
ValueNetworkMixin.__init__(self, config) | |||
KLCoeffMixin.__init__(self, config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is KLCoeffMixin
not constructed in not enable_learner_api
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Long story. I left it in b/c this Mixin alters the get_state/set_state logic. So I would have to hack these methods as well. In short: Taking this out would make everything even more complicated as it already is. We should just put Note
s everywhere explaining that these are temporary solutions and that soon, Policy will no longer be used in the new stack, not for learning and not for sampling.
@@ -325,6 +325,27 @@ def validate(self) -> None: | |||
# Check `entropy_coeff` for correctness. | |||
if self.entropy_coeff < 0.0: | |||
raise ValueError("`entropy_coeff` must be >= 0.0") | |||
# Entropy coeff schedule checking. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see, we should avoid duplicate code. This should just be a template for all schedulers. We will parametrize it such that it can be rendered for entropy scheduler vs. lr scheduler vs. foo scheduler
rllib/algorithms/ppo/ppo_learner.py
Outdated
self.curr_entropy_coeffs_per_module = defaultdict( | ||
lambda: self._get_tensor_variable(self.hps.entropy_coeff) | ||
lambda: self._get_tensor_variable(self.hps.entropy_coeff_schedule[0][1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment above for readability.
EntropyCoeffSchedule.__init__( | ||
self, config["entropy_coeff"], config["entropy_coeff_schedule"] | ||
) | ||
LearningRateSchedule.__init__(self, config["lr"], config["lr_schedule"]) | ||
KLCoeffMixin.__init__(self, config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we put all coef mixins in the if condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explained above: Some mixins add extra logic to some base-methods, e.g. get_state, which accesses variables constructed in the c'tor. So if we skip the c'tor, we'll also have to skip these method calls or hack them OR take out the mixin altogether, which would require a new class. Either way, this seemed to be the least painful, albeit temporary solution for this problem.
@@ -126,10 +126,13 @@ def test_ppo_compilation_and_schedule_mixins(self): | |||
config.training(model=get_model_config(fw, lstm=lstm)) | |||
|
|||
algo = config.build(env=env) | |||
learner = algo.learner_group._learner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use some public api for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed the same pattern is used for optim. I think this is bad as it relies on the internals of learner_group.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add a TODO for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure whether this is a good idea. What if the LearnerGroup doesn't have a local Learner? Here in this test, we know for sure that we are dealing with a local Learner inside the LearnerGroup. Doing this here is just for a very specific inspection job, nothing else. I don't really want users to poke around the internals like this.
Alternatively, we need more discussion on whether we would like to add utilities like foreach_learner
, etc.. as we already have this for WorkerSet
on the sampling side.
Adding a TODO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah. the only downside I see is that this test written this way has to be upgraded if we upgrade the internal implementation of LearnerGroup
s. For example, let's say tomorrow, instead of _learner
attribute I use _local_learner
attribute. This test will break then, while the testing strategy is still valid. Idk, sometimes I think we should never write a test for private behaviors of modules, we should test only the public behaviors.
@@ -274,6 +274,7 @@ def policy_mapping_fn(agent_id, episode, worker, **kwargs): | |||
# Train the "main" policy to play really well using self-play. | |||
results = None | |||
if not args.from_checkpoint: | |||
create_checkpoints = not bool(os.environ.get("RLLIB_ENABLE_RL_MODULE", False)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why should this not create checkpoints when RLModule is enabled? Does this test break? I think we should add a TODO at the very least. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checkpoints don't work yet with this config. Not sure, why. I guess this is @avnishn 's domain. I'm getting a NotImplementedError on some method that's supposed to return a path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, can we make sure we add a TODO (Avnish,Sven). I just don't want to lose test coverage that is uncaught.
Signed-off-by: sven1977 <[email protected]>
…ner_rlm_policies_simplifications
File "/ray/python/ray/rllib/core/learner/torch/torch_learner.py", line 79, in configure_optimizer_per_module | lr = self.curr_lr_per_module[module_id] |
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
@@ -805,7 +805,7 @@ py_test( | |||
py_test( | |||
name = "test_algorithm_config", | |||
tags = ["team:rllib", "algorithms_dir", "algorithms_dir_generic"], | |||
size = "small", | |||
size = "medium", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to timeout sometimes, when CI is very busy.
@@ -0,0 +1,157 @@ | |||
from collections import defaultdict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New API to use for anything that can be scheduled and needs to track an in-graph (torch|tf) variable based on the current timestep.
…ner_rlm_policies_simplifications
…ot need loss/stats/mixins. (ray-project#34445)
Learner API: Policies using RLModules (for sampler only) do not need loss/stats/mixins.
Cleans up all Policy classes that were built particularly for being run with _enable_learner_api + _enable_rl_module_api both set to True.
Clear out the loss functions of those policies (no longer needed as these Policies are only run on the sampler/worker side).
Same for stats_fn and all mixin parent classes (such as LearingRateSchedule, KL-stuff, EntropyCoeffSchedule, TargetNetworkMixin, etc..)
Added a new APPOTorchPolicyRLM for future PR that will enable APPO on the new API stack + torch.
Why are these changes needed?
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.