[RLlib] Learner API: Policies using RLModules (for sampler only) do not need loss/stats/mixins. #34445

sven1977 · 2023-04-15T09:59:55Z

Learner API: Policies using RLModules (for sampler only) do not need loss/stats/mixins.

Cleans up all Policy classes that were built particularly for being run with _enable_learner_api + _enable_rl_module_api both set to True.
Clear out the loss functions of those policies (no longer needed as these Policies are only run on the sampler/worker side).
Same for stats_fn and all mixin parent classes (such as LearingRateSchedule, KL-stuff, EntropyCoeffSchedule, TargetNetworkMixin, etc..)
Added a new APPOTorchPolicyRLM for future PR that will enable APPO on the new API stack + torch.

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: sven1977 <[email protected]>

avnishn · 2023-04-15T17:59:37Z

Im all for removing the loss functions but can we come up with tests for the compute loss functions while deleting these?

sven1977 · 2023-04-25T19:16:30Z

Closing this for now as we do still need the loss functions in the RLMPolicies in case user runs with settings: enable_rl_module_api=True, but enable_learner_api=False.

Signed-off-by: sven1977 <[email protected]>

kouroshHakha

left some comments.

kouroshHakha · 2023-05-04T17:25:39Z

rllib/algorithms/algorithm_config.py

@@ -853,11 +853,15 @@ def validate(self) -> None:
 )

 # Learner API requires RLModule API.
- if self._enable_learner_api and not self._enable_rl_module_api:
+ if not (


how about if self._enable_learner_api != self._enable_rl_module_api?

That'd probably work as well :D

kouroshHakha · 2023-05-04T17:30:03Z

rllib/algorithms/appo/tests/tf/test_appo_learner.py


- check(learner_group_loss, policy_loss)
+ learner_group.update(train_batch.as_multi_agent())


do you think we'd still need this test? This test was just checking the RLM policy loss == learner_loss coming from learning group.update() I don't know if this test is relevant anymore with this PR. More important question is how should we test the learner Loss correctness for each algorithm?

kouroshHakha · 2023-05-04T18:24:55Z

rllib/algorithms/appo/tf/appo_tf_policy_rlm.py

should we just remove all xxx_policy_rlm.py files? this boiler plate code can be part of the base classes? EagerTF2 and torch_v2?

Yeah, even better. Let's get rid of it. ...

Signed-off-by: sven1977 <[email protected]>

…ner_rlm_policies_simplifications

Signed-off-by: sven1977 <[email protected]>

…ions' into learner_rlm_policies_simplifications

Signed-off-by: sven1977 <[email protected]>

sven1977 · 2023-05-06T18:24:19Z

rllib/algorithms/algorithm_config.py

 self.enable_connectors = True

+ # LR-schedule checking.


Added this check (due diligence). We are now doing a more proper learning rate scheduling, centrally, inside the core Learner class.

sven1977 · 2023-05-06T18:24:44Z

rllib/algorithms/appo/appo.py

- return APPOTorchPolicyWithRLModule
- else:
- from ray.rllib.algorithms.appo.appo_torch_policy import APPOTorchPolicy
+ from ray.rllib.algorithms.appo.appo_torch_policy import APPOTorchPolicy


No need for special policy classes anymore

sven1977 · 2023-05-06T18:25:40Z

rllib/algorithms/impala/impala.py

@@ -371,6 +371,27 @@ def validate(self) -> None:
 # Check `entropy_coeff` for correctness.
 if self.entropy_coeff < 0.0:
 raise ValueError("`entropy_coeff` must be >= 0.0!")
+ # Entropy coeff schedule checking.


Same checking for entropy coeff schedules. Scheduling is now handled centrally in the ImpalaLearner base class.

this is super similar to the lr schedule. should make this snippet of code shared between any type of scheduler and parametrize it in a way that can be used in both places?

done, added a new, framework agnostic Scheduler API that handles validation, single-fixed values, and (piecewise) schedules. Used now for lr- and entropy schedules (possibly more soon as we add more algos to the new learner API).

sven1977 · 2023-05-06T18:27:35Z

rllib/algorithms/impala/impala_learner.py

 self.curr_entropy_coeffs_per_module = defaultdict(
- lambda: self._get_tensor_variable(self.hps.entropy_coeff)
+ lambda: self._get_tensor_variable(self.hps.entropy_coeff_schedule[0][1])


This is a new concept:
For all schedules, lr, entropy, etc..
Do NOT use the non-schedule value anymore anywhere, not even for the initial value. Instead, the schedule is forced (via validate()) to have a 0-timestep entry, which defines the initial value. This is much more consistent and transparent for users: Either I use lr OR I use lr_schedule, but never both and there is no value-leakage between the two anymore. Proper validation code and descriptive error messages have been added.

Where do we enforce that only can be specified? lr vs lr_schedule?

Please do not hard code 0/1 indices here. For better readability:

_, init_entropy_coeff = next(iter(self.hps.entropy_coeff_schedule))

See my comment below. We check now in AlgorithmConfig (where it belongs).
I can add a comment explaining the [0][1], but I have to hard-code this! The first entry must start from ts=0 (to force more explicitness as to which value we start with). The definition of schedule configs is: List of tuples of the form ([ts], [lr to be reached at that ts]).

yes I agree. but you don't need to hardcode [0][1]. The check is already happening in validate() of algorithm config. So here we are assuming that that test has already been checked. Now I need readability and therefore why

_, init_entropy_coeff = next(iter(self.hps.entropy_coeff_schedule))

is better than init_entropy_coeff = self.hps.entropy_coeff_schedule[0][1]

You can classify this comment as nit :)

sven1977 · 2023-05-06T18:27:56Z

rllib/algorithms/impala/tests/test_impala_off_policyness.py

- if not enable_learner_api:
- off_policy_ness = check_off_policyness(results, upper_limit=2.0)
- print(f"off-policy'ness={off_policy_ness}")
+ # added back to the IMPALA Learner.


TODO: @avnishn :)

sven1977 · 2023-05-06T18:28:08Z

rllib/algorithms/ppo/ppo.py

@@ -325,6 +325,27 @@ def validate(self) -> None:
 # Check `entropy_coeff` for correctness.
 if self.entropy_coeff < 0.0:
 raise ValueError("`entropy_coeff` must be >= 0.0")
+ # Entropy coeff schedule checking.


see IMPALA above

see, we should avoid duplicate code. This should just be a template for all schedulers. We will parametrize it such that it can be rendered for entropy scheduler vs. lr scheduler vs. foo scheduler

done, added new Scheduler API for all these instances

sven1977 · 2023-05-06T18:29:13Z

rllib/algorithms/ppo/tf/ppo_tf_learner.py

@@ -25,7 +25,7 @@


 class PPOTfLearner(PPOLearner, TfLearner):
- """Implements tf-specific PPO loss logic on top of PPOLearner.
+ """Implements curr_entropy_coeffs_per_moduletf-specific PPO loss logic on top of PPOLearner.


typo, will fix.

sven1977 · 2023-05-06T18:29:39Z

rllib/algorithms/tests/test_algorithm_config.py

@@ -179,6 +179,7 @@ def test_rl_module_api(self):
 .framework("torch")
 .rollouts(enable_connectors=True)
 .rl_module(_enable_rl_module_api=True)
+ .training(_enable_learner_api=True)


Had to fix a bunch of test cases that were still using learner=False, but RLModule=True. This is not possible anymore.

Signed-off-by: sven1977 <[email protected]>

kouroshHakha

Dude this is amazing. Best PR of the week. A couple of comments and is good to go.

kouroshHakha · 2023-05-06T20:29:50Z

rllib/algorithms/algorithm_config.py

 self.enable_connectors = True

+ # LR-schedule checking.
+ # For the new Learner API stack, any schedule must start at ts 0 to avoid
+ # ambiguity (user might think that the `lr` setting plays a role as well


the sentence in parentheses don't make sense? what is it that user might think?

what will happen to the regular lr that I pass in? will that get ignored?

New behavior:

either lr is used (lr_schedule is None).

or lr_schedule is used (not None), in which case lr is ignored.

Old behavior:
example: lr=100 lr_schedule=[[1M, 0.001], [2M, 0.0003]]
-> for the first 1M steps, we use lr=100, then 0.001, linearly decreasing to 0.0003 (at ts=2M)

I find this behavior very confusing b/c it depends then on the timesteps I define in my schedule (if I define ts=0 in my schedule, then lr is never used b/c the schedule "kicks" right in, but if I start my schedule definition at some ts >0, then lr is used for the first n timesteps).

We could be even more strict and force users to set either one of the setting to None.
Only allowed: lr=None, lr_schedule is not None OR lr is not None, lr_schedule=None.

kouroshHakha · 2023-05-06T20:37:35Z

rllib/algorithms/impala/impala.py

@@ -371,6 +371,27 @@ def validate(self) -> None:
 # Check `entropy_coeff` for correctness.
 if self.entropy_coeff < 0.0:
 raise ValueError("`entropy_coeff` must be >= 0.0!")
+ # Entropy coeff schedule checking.


this is super similar to the lr schedule. should make this snippet of code shared between any type of scheduler and parametrize it in a way that can be used in both places?

kouroshHakha · 2023-05-06T20:38:41Z

rllib/algorithms/impala/impala.py

- return ImpalaTF1Policy
- else:
- from ray.rllib.algorithms.a3c.a3c_tf_policy import A3CTFPolicy
+ return A3CTFPolicy


If we remove A3C from the algorithms? what would happen to impala?

Good point. We can just copy it into a new Impala[Tf|Torch]Policy class.

kouroshHakha · 2023-05-06T20:42:20Z

rllib/algorithms/impala/impala_learner.py

 self.curr_entropy_coeffs_per_module = defaultdict(
- lambda: self._get_tensor_variable(self.hps.entropy_coeff)
+ lambda: self._get_tensor_variable(self.hps.entropy_coeff_schedule[0][1])


Where do we enforce that only can be specified? lr vs lr_schedule?

Please do not hard code 0/1 indices here. For better readability:

_, init_entropy_coeff = next(iter(self.hps.entropy_coeff_schedule))

kouroshHakha · 2023-05-06T20:43:32Z

rllib/algorithms/appo/appo_tf_policy.py

@@ -104,7 +109,9 @@ def __init__(
 )
 ValueNetworkMixin.__init__(self, config)
 KLCoeffMixin.__init__(self, config)


why is KLCoeffMixin not constructed in not enable_learner_api?

Long story. I left it in b/c this Mixin alters the get_state/set_state logic. So I would have to hack these methods as well. In short: Taking this out would make everything even more complicated as it already is. We should just put Notes everywhere explaining that these are temporary solutions and that soon, Policy will no longer be used in the new stack, not for learning and not for sampling.

kouroshHakha · 2023-05-06T20:48:06Z

rllib/algorithms/ppo/ppo.py

@@ -325,6 +325,27 @@ def validate(self) -> None:
 # Check `entropy_coeff` for correctness.
 if self.entropy_coeff < 0.0:
 raise ValueError("`entropy_coeff` must be >= 0.0")
+ # Entropy coeff schedule checking.


see, we should avoid duplicate code. This should just be a template for all schedulers. We will parametrize it such that it can be rendered for entropy scheduler vs. lr scheduler vs. foo scheduler

kouroshHakha · 2023-05-06T20:48:26Z

rllib/algorithms/ppo/ppo_learner.py

 self.curr_entropy_coeffs_per_module = defaultdict(
- lambda: self._get_tensor_variable(self.hps.entropy_coeff)
+ lambda: self._get_tensor_variable(self.hps.entropy_coeff_schedule[0][1])


See my comment above for readability.

kouroshHakha · 2023-05-06T20:49:02Z

rllib/algorithms/ppo/ppo_tf_policy.py

 EntropyCoeffSchedule.__init__(
 self, config["entropy_coeff"], config["entropy_coeff_schedule"]
 )
 LearningRateSchedule.__init__(self, config["lr"], config["lr_schedule"])
+ KLCoeffMixin.__init__(self, config)


shouldn't we put all coef mixins in the if condition?

Explained above: Some mixins add extra logic to some base-methods, e.g. get_state, which accesses variables constructed in the c'tor. So if we skip the c'tor, we'll also have to skip these method calls or hack them OR take out the mixin altogether, which would require a new class. Either way, this seemed to be the least painful, albeit temporary solution for this problem.

kouroshHakha · 2023-05-06T20:51:03Z

rllib/algorithms/ppo/tests/test_ppo_with_rl_module.py

@@ -126,10 +126,13 @@ def test_ppo_compilation_and_schedule_mixins(self):
 config.training(model=get_model_config(fw, lstm=lstm))

 algo = config.build(env=env)
+ learner = algo.learner_group._learner


Can we use some public api for this?

I just noticed the same pattern is used for optim. I think this is bad as it relies on the internals of learner_group.

We can add a TODO for now.

I'm not sure whether this is a good idea. What if the LearnerGroup doesn't have a local Learner? Here in this test, we know for sure that we are dealing with a local Learner inside the LearnerGroup. Doing this here is just for a very specific inspection job, nothing else. I don't really want users to poke around the internals like this.

Alternatively, we need more discussion on whether we would like to add utilities like foreach_learner, etc.. as we already have this for WorkerSet on the sampling side.

Adding a TODO.

yeah. the only downside I see is that this test written this way has to be upgraded if we upgrade the internal implementation of LearnerGroups. For example, let's say tomorrow, instead of _learner attribute I use _local_learner attribute. This test will break then, while the testing strategy is still valid. Idk, sometimes I think we should never write a test for private behaviors of modules, we should test only the public behaviors.

kouroshHakha · 2023-05-06T20:53:54Z

rllib/examples/self_play_with_open_spiel.py

@@ -274,6 +274,7 @@ def policy_mapping_fn(agent_id, episode, worker, **kwargs):
 # Train the "main" policy to play really well using self-play.
 results = None
 if not args.from_checkpoint:
+ create_checkpoints = not bool(os.environ.get("RLLIB_ENABLE_RL_MODULE", False))


why should this not create checkpoints when RLModule is enabled? Does this test break? I think we should add a TODO at the very least. :)

Checkpoints don't work yet with this config. Not sure, why. I guess this is @avnishn 's domain. I'm getting a NotImplementedError on some method that's supposed to return a path.

yeah, can we make sure we add a TODO (Avnish,Sven). I just don't want to lose test coverage that is uncaught.

Signed-off-by: sven1977 <[email protected]>

…ner_rlm_policies_simplifications

ollie-iterators · 2023-05-07T14:09:17Z

File "/ray/python/ray/rllib/core/learner/torch/torch_learner.py", line 79, in configure_optimizer_per_module

| lr = self.curr_lr_per_module[module_id]
AttributeError: 'PPOTorchLearner' object has no attribute 'curr_lr_per_module'

Signed-off-by: sven1977 <[email protected]>

sven1977 · 2023-05-08T07:34:05Z

rllib/BUILD

@@ -805,7 +805,7 @@ py_test(
 py_test(
 name = "test_algorithm_config",
 tags = ["team:rllib", "algorithms_dir", "algorithms_dir_generic"],
- size = "small",
+ size = "medium",


Seems to timeout sometimes, when CI is very busy.

sven1977 · 2023-05-08T07:36:25Z

rllib/utils/schedules/scheduler.py

@@ -0,0 +1,157 @@
+from collections import defaultdict


New API to use for anything that can be scheduled and needs to track an in-graph (torch|tf) variable based on the current timestep.

…ner_rlm_policies_simplifications

…ot need loss/stats/mixins. (ray-project#34445)

sven1977 added 2 commits April 15, 2023 11:30

wip

47c003c

Signed-off-by: sven1977 <[email protected]>

wip

cc25d0c

Signed-off-by: sven1977 <[email protected]>

sven1977 requested review from gjoliver, avnishn, ArturNiederfahrenhorst, smorad, maxpumperla, kouroshHakha and krfricke as code owners April 15, 2023 09:59

sven1977 assigned avnishn Apr 15, 2023

sven1977 closed this Apr 25, 2023

sven1977 reopened this May 4, 2023

merge

e6f2d3b

Signed-off-by: sven1977 <[email protected]>

sven1977 requested review from richardliaw, ericl and edoakes as code owners May 4, 2023 15:22

sven1977 added 3 commits May 4, 2023 17:46

wip

0248ce8

Signed-off-by: sven1977 <[email protected]>

wip

04595b3

Signed-off-by: sven1977 <[email protected]>

wip

bf78988

Signed-off-by: sven1977 <[email protected]>

sven1977 assigned kouroshHakha and unassigned avnishn May 4, 2023

kouroshHakha reviewed May 4, 2023

View reviewed changes

sven1977 added 6 commits May 4, 2023 20:47

wip

ceb8caa

Signed-off-by: sven1977 <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into lear…

cc632cb

…ner_rlm_policies_simplifications

merge

904bfda

Signed-off-by: sven1977 <[email protected]>

wip

b36304d

Signed-off-by: sven1977 <[email protected]>

wip

a112d5b

Signed-off-by: sven1977 <[email protected]>

wip

4b600ef

Signed-off-by: sven1977 <[email protected]>

sven1977 added 2 commits May 6, 2023 17:11

LINT

f175a39

Signed-off-by: sven1977 <[email protected]>

fix

3e95159

Signed-off-by: sven1977 <[email protected]>

sven1977 added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label May 6, 2023

sven1977 added 4 commits May 6, 2023 18:32

fix

0ad186f

Signed-off-by: sven1977 <[email protected]>

wip

3a85eae

Signed-off-by: sven1977 <[email protected]>

Merge remote-tracking branch 'origin/learner_rlm_policies_simplificat…

1380cb6

…ions' into learner_rlm_policies_simplifications

merge

e9ad050

Signed-off-by: sven1977 <[email protected]>

sven1977 commented May 6, 2023

View reviewed changes

sven1977 added 2 commits May 6, 2023 20:53

fix

05103ba

Signed-off-by: sven1977 <[email protected]>

LINT

f0c1145

Signed-off-by: sven1977 <[email protected]>

kouroshHakha reviewed May 6, 2023

View reviewed changes

sven1977 added 2 commits May 7, 2023 13:08

Add new Scheduler API.

90ee0ab

Signed-off-by: sven1977 <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into lear…

a628043

…ner_rlm_policies_simplifications

sven1977 added 3 commits May 7, 2023 19:54

wip

6e9b0cd

Signed-off-by: sven1977 <[email protected]>

LINT

7f85d0a

Signed-off-by: sven1977 <[email protected]>

fix

1bb3ec2

Signed-off-by: sven1977 <[email protected]>

sven1977 commented May 8, 2023

View reviewed changes

Merge branch 'master' of https://github.com/ray-project/ray into lear…

21486d7

…ner_rlm_policies_simplifications

kouroshHakha approved these changes May 8, 2023

View reviewed changes

sven1977 merged commit 78b58a9 into ray-project:master May 8, 2023
1 check passed

architkulkarni pushed a commit to architkulkarni/ray that referenced this pull request May 16, 2023

[RLlib] Learner API: Policies using RLModules (for sampler only) do n…

57ba66e

…ot need loss/stats/mixins. (ray-project#34445)


		check(learner_group_loss, policy_loss)
		learner_group.update(train_batch.as_multi_agent())

[RLlib] Learner API: Policies using RLModules (for sampler only) do not need loss/stats/mixins. #34445

[RLlib] Learner API: Policies using RLModules (for sampler only) do not need loss/stats/mixins. #34445

Conversation

sven1977 commented Apr 15, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

avnishn commented Apr 15, 2023

sven1977 commented Apr 25, 2023

kouroshHakha left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kouroshHakha May 8, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kouroshHakha left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ollie-iterators commented May 7, 2023

File "/ray/python/ray/rllib/core/learner/torch/torch_learner.py", line 79, in configure_optimizer_per_module

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sven1977 commented Apr 15, 2023 •

edited

Loading

kouroshHakha May 8, 2023 •

edited

Loading