Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib] Learner API: Policies using RLModules (for sampler only) do not need loss/stats/mixins. #34445

Merged

Conversation

sven1977
Copy link
Contributor

@sven1977 sven1977 commented Apr 15, 2023

Learner API: Policies using RLModules (for sampler only) do not need loss/stats/mixins.

  • Cleans up all Policy classes that were built particularly for being run with _enable_learner_api + _enable_rl_module_api both set to True.

  • Clear out the loss functions of those policies (no longer needed as these Policies are only run on the sampler/worker side).

  • Same for stats_fn and all mixin parent classes (such as LearingRateSchedule, KL-stuff, EntropyCoeffSchedule, TargetNetworkMixin, etc..)

  • Added a new APPOTorchPolicyRLM for future PR that will enable APPO on the new API stack + torch.

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
@avnishn
Copy link
Member

avnishn commented Apr 15, 2023

Im all for removing the loss functions but can we come up with tests for the compute loss functions while deleting these?

@sven1977
Copy link
Contributor Author

Closing this for now as we do still need the loss functions in the RLMPolicies in case user runs with settings: enable_rl_module_api=True, but enable_learner_api=False.

@sven1977 sven1977 closed this Apr 25, 2023
@sven1977 sven1977 reopened this May 4, 2023
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
@sven1977 sven1977 assigned kouroshHakha and unassigned avnishn May 4, 2023
Copy link
Contributor

@kouroshHakha kouroshHakha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some comments.

@@ -853,11 +853,15 @@ def validate(self) -> None:
)

# Learner API requires RLModule API.
if self._enable_learner_api and not self._enable_rl_module_api:
if not (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about if self._enable_learner_api != self._enable_rl_module_api?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That'd probably work as well :D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


check(learner_group_loss, policy_loss)
learner_group.update(train_batch.as_multi_agent())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think we'd still need this test? This test was just checking the RLM policy loss == learner_loss coming from learning group.update() I don't know if this test is relevant anymore with this PR. More important question is how should we test the learner Loss correctness for each algorithm?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we just remove all xxx_policy_rlm.py files? this boiler plate code can be part of the base classes? EagerTF2 and torch_v2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, even better. Let's get rid of it. ...

Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
@sven1977 sven1977 added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label May 6, 2023
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
…ions' into learner_rlm_policies_simplifications
Signed-off-by: sven1977 <[email protected]>
self.enable_connectors = True

# LR-schedule checking.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this check (due diligence). We are now doing a more proper learning rate scheduling, centrally, inside the core Learner class.

return APPOTorchPolicyWithRLModule
else:
from ray.rllib.algorithms.appo.appo_torch_policy import APPOTorchPolicy
from ray.rllib.algorithms.appo.appo_torch_policy import APPOTorchPolicy
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for special policy classes anymore

@@ -371,6 +371,27 @@ def validate(self) -> None:
# Check `entropy_coeff` for correctness.
if self.entropy_coeff < 0.0:
raise ValueError("`entropy_coeff` must be >= 0.0!")
# Entropy coeff schedule checking.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same checking for entropy coeff schedules. Scheduling is now handled centrally in the ImpalaLearner base class.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is super similar to the lr schedule. should make this snippet of code shared between any type of scheduler and parametrize it in a way that can be used in both places?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, added a new, framework agnostic Scheduler API that handles validation, single-fixed values, and (piecewise) schedules. Used now for lr- and entropy schedules (possibly more soon as we add more algos to the new learner API).

self.curr_entropy_coeffs_per_module = defaultdict(
lambda: self._get_tensor_variable(self.hps.entropy_coeff)
lambda: self._get_tensor_variable(self.hps.entropy_coeff_schedule[0][1])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new concept:
For all schedules, lr, entropy, etc..
Do NOT use the non-schedule value anymore anywhere, not even for the initial value. Instead, the schedule is forced (via validate()) to have a 0-timestep entry, which defines the initial value. This is much more consistent and transparent for users: Either I use lr OR I use lr_schedule, but never both and there is no value-leakage between the two anymore. Proper validation code and descriptive error messages have been added.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we enforce that only can be specified? lr vs lr_schedule?

Please do not hard code 0/1 indices here. For better readability:

_, init_entropy_coeff = next(iter(self.hps.entropy_coeff_schedule))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment below. We check now in AlgorithmConfig (where it belongs).
I can add a comment explaining the [0][1], but I have to hard-code this! The first entry must start from ts=0 (to force more explicitness as to which value we start with). The definition of schedule configs is: List of tuples of the form ([ts], [lr to be reached at that ts]).

Copy link
Contributor

@kouroshHakha kouroshHakha May 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I agree. but you don't need to hardcode [0][1]. The check is already happening in validate() of algorithm config. So here we are assuming that that test has already been checked. Now I need readability and therefore why

_, init_entropy_coeff = next(iter(self.hps.entropy_coeff_schedule))

is better than init_entropy_coeff = self.hps.entropy_coeff_schedule[0][1]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can classify this comment as nit :)

if not enable_learner_api:
off_policy_ness = check_off_policyness(results, upper_limit=2.0)
print(f"off-policy'ness={off_policy_ness}")
# added back to the IMPALA Learner.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: @avnishn :)

@@ -325,6 +325,27 @@ def validate(self) -> None:
# Check `entropy_coeff` for correctness.
if self.entropy_coeff < 0.0:
raise ValueError("`entropy_coeff` must be >= 0.0")
# Entropy coeff schedule checking.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see IMPALA above

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see, we should avoid duplicate code. This should just be a template for all schedulers. We will parametrize it such that it can be rendered for entropy scheduler vs. lr scheduler vs. foo scheduler

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, added new Scheduler API for all these instances

@@ -25,7 +25,7 @@


class PPOTfLearner(PPOLearner, TfLearner):
"""Implements tf-specific PPO loss logic on top of PPOLearner.
"""Implements curr_entropy_coeffs_per_moduletf-specific PPO loss logic on top of PPOLearner.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo, will fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -179,6 +179,7 @@ def test_rl_module_api(self):
.framework("torch")
.rollouts(enable_connectors=True)
.rl_module(_enable_rl_module_api=True)
.training(_enable_learner_api=True)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to fix a bunch of test cases that were still using learner=False, but RLModule=True. This is not possible anymore.

Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Copy link
Contributor

@kouroshHakha kouroshHakha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dude this is amazing. Best PR of the week. A couple of comments and is good to go.

self.enable_connectors = True

# LR-schedule checking.
# For the new Learner API stack, any schedule must start at ts 0 to avoid
# ambiguity (user might think that the `lr` setting plays a role as well
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the sentence in parentheses don't make sense? what is it that user might think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what will happen to the regular lr that I pass in? will that get ignored?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New behavior:

  • either lr is used (lr_schedule is None).
  • or lr_schedule is used (not None), in which case lr is ignored.

Old behavior:
example: lr=100 lr_schedule=[[1M, 0.001], [2M, 0.0003]]
-> for the first 1M steps, we use lr=100, then 0.001, linearly decreasing to 0.0003 (at ts=2M)

I find this behavior very confusing b/c it depends then on the timesteps I define in my schedule (if I define ts=0 in my schedule, then lr is never used b/c the schedule "kicks" right in, but if I start my schedule definition at some ts >0, then lr is used for the first n timesteps).

We could be even more strict and force users to set either one of the setting to None.
Only allowed: lr=None, lr_schedule is not None OR lr is not None, lr_schedule=None.

@@ -371,6 +371,27 @@ def validate(self) -> None:
# Check `entropy_coeff` for correctness.
if self.entropy_coeff < 0.0:
raise ValueError("`entropy_coeff` must be >= 0.0!")
# Entropy coeff schedule checking.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is super similar to the lr schedule. should make this snippet of code shared between any type of scheduler and parametrize it in a way that can be used in both places?

return ImpalaTF1Policy
else:
from ray.rllib.algorithms.a3c.a3c_tf_policy import A3CTFPolicy
return A3CTFPolicy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we remove A3C from the algorithms? what would happen to impala?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. We can just copy it into a new Impala[Tf|Torch]Policy class.

self.curr_entropy_coeffs_per_module = defaultdict(
lambda: self._get_tensor_variable(self.hps.entropy_coeff)
lambda: self._get_tensor_variable(self.hps.entropy_coeff_schedule[0][1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we enforce that only can be specified? lr vs lr_schedule?

Please do not hard code 0/1 indices here. For better readability:

_, init_entropy_coeff = next(iter(self.hps.entropy_coeff_schedule))

@@ -104,7 +109,9 @@ def __init__(
)
ValueNetworkMixin.__init__(self, config)
KLCoeffMixin.__init__(self, config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is KLCoeffMixin not constructed in not enable_learner_api?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long story. I left it in b/c this Mixin alters the get_state/set_state logic. So I would have to hack these methods as well. In short: Taking this out would make everything even more complicated as it already is. We should just put Notes everywhere explaining that these are temporary solutions and that soon, Policy will no longer be used in the new stack, not for learning and not for sampling.

@@ -325,6 +325,27 @@ def validate(self) -> None:
# Check `entropy_coeff` for correctness.
if self.entropy_coeff < 0.0:
raise ValueError("`entropy_coeff` must be >= 0.0")
# Entropy coeff schedule checking.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see, we should avoid duplicate code. This should just be a template for all schedulers. We will parametrize it such that it can be rendered for entropy scheduler vs. lr scheduler vs. foo scheduler

self.curr_entropy_coeffs_per_module = defaultdict(
lambda: self._get_tensor_variable(self.hps.entropy_coeff)
lambda: self._get_tensor_variable(self.hps.entropy_coeff_schedule[0][1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment above for readability.

EntropyCoeffSchedule.__init__(
self, config["entropy_coeff"], config["entropy_coeff_schedule"]
)
LearningRateSchedule.__init__(self, config["lr"], config["lr_schedule"])
KLCoeffMixin.__init__(self, config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we put all coef mixins in the if condition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explained above: Some mixins add extra logic to some base-methods, e.g. get_state, which accesses variables constructed in the c'tor. So if we skip the c'tor, we'll also have to skip these method calls or hack them OR take out the mixin altogether, which would require a new class. Either way, this seemed to be the least painful, albeit temporary solution for this problem.

@@ -126,10 +126,13 @@ def test_ppo_compilation_and_schedule_mixins(self):
config.training(model=get_model_config(fw, lstm=lstm))

algo = config.build(env=env)
learner = algo.learner_group._learner
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use some public api for this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed the same pattern is used for optim. I think this is bad as it relies on the internals of learner_group.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add a TODO for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether this is a good idea. What if the LearnerGroup doesn't have a local Learner? Here in this test, we know for sure that we are dealing with a local Learner inside the LearnerGroup. Doing this here is just for a very specific inspection job, nothing else. I don't really want users to poke around the internals like this.

Alternatively, we need more discussion on whether we would like to add utilities like foreach_learner, etc.. as we already have this for WorkerSet on the sampling side.

Adding a TODO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. the only downside I see is that this test written this way has to be upgraded if we upgrade the internal implementation of LearnerGroups. For example, let's say tomorrow, instead of _learner attribute I use _local_learner attribute. This test will break then, while the testing strategy is still valid. Idk, sometimes I think we should never write a test for private behaviors of modules, we should test only the public behaviors.

@@ -274,6 +274,7 @@ def policy_mapping_fn(agent_id, episode, worker, **kwargs):
# Train the "main" policy to play really well using self-play.
results = None
if not args.from_checkpoint:
create_checkpoints = not bool(os.environ.get("RLLIB_ENABLE_RL_MODULE", False))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why should this not create checkpoints when RLModule is enabled? Does this test break? I think we should add a TODO at the very least. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checkpoints don't work yet with this config. Not sure, why. I guess this is @avnishn 's domain. I'm getting a NotImplementedError on some method that's supposed to return a path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, can we make sure we add a TODO (Avnish,Sven). I just don't want to lose test coverage that is uncaught.

@ollie-iterators
Copy link

File "/ray/python/ray/rllib/core/learner/torch/torch_learner.py", line 79, in configure_optimizer_per_module

  | lr = self.curr_lr_per_module[module_id]
AttributeError: 'PPOTorchLearner' object has no attribute 'curr_lr_per_module'

Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
@@ -805,7 +805,7 @@ py_test(
py_test(
name = "test_algorithm_config",
tags = ["team:rllib", "algorithms_dir", "algorithms_dir_generic"],
size = "small",
size = "medium",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to timeout sometimes, when CI is very busy.

@@ -0,0 +1,157 @@
from collections import defaultdict
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New API to use for anything that can be scheduled and needs to track an in-graph (torch|tf) variable based on the current timestep.

@sven1977 sven1977 merged commit 78b58a9 into ray-project:master May 8, 2023
1 check passed
architkulkarni pushed a commit to architkulkarni/ray that referenced this pull request May 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants