[RLlib] Load state from load_state_path for rlmodule spec #35180

avnishn · 2023-05-09T18:48:44Z

Signed-off-by: Avnish [email protected]

Add ability for rl module and marl modules to be created and their states be loaded immediately via the rl module spec.

Add tests for basic spec loading, and multinode uncheckpointing.

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Avnish <[email protected]>

…ccept a dir instead of a path Signed-off-by: Avnish <[email protected]>

Signed-off-by: Avnish <[email protected]>

rllib/core/rl_module/marl_module.py

Signed-off-by: Avnish <[email protected]>

…state_loading_rl_module_spec

Signed-off-by: Avnish <[email protected]>

Signed-off-by: avnishn <[email protected]>

add end to end tests for the marl module uncheckpointing with the ppo algorithm. Move the kl checking in ppo tf module because it is causing a tf auto graph error for some reason Signed-off-by: avnishn <[email protected]>

Signed-off-by: avnishn <[email protected]>

…state_loading_rl_module_spec

Signed-off-by: avnishn <[email protected]>

…state_loading_rl_module_spec

Signed-off-by: Avnish <[email protected]>

gjoliver

had a couple of questions I want to answer first.

release/release_tests.yaml

release/rllib_tests/checkpointing_tests/test_rl_module_spec_uncheckpointing.py

rllib/algorithms/algorithm_config.py

rllib/core/rl_module/marl_module.py

…state_loading_rl_module_spec

Signed-off-by: Avnish <[email protected]>

…state_loading_rl_module_spec

rllib/algorithms/algorithm.py

Signed-off-by: Avnish <[email protected]>

…state_loading_rl_module_spec

Signed-off-by: Avnish <[email protected]>

Signed-off-by: avnishn <[email protected]>

…state_loading_rl_module_spec

Signed-off-by: Avnish <[email protected]>

avnishn · 2023-05-24T19:18:21Z

TODOS:

Add tests for if someone specifies modules to load
make release tests also run on the pull ci

kouroshHakha

Nice PR man. I really enjoyed it. I have a couple of qs, nothing that is merge blockers tho. Let me know when I should push the button.

kouroshHakha · 2023-05-24T19:29:47Z

rllib/core/rl_module/marl_module.py

@@ -403,12 +400,24 @@ class MultiAgentRLModuleSpec:
 module_specs: The module specs for each individual module. It can be either a
 SingleAgentRLModuleSpec used for all module_ids or a dictionary mapping
 from module IDs to SingleAgentRLModuleSpecs for each individual module.
+ load_state_path: The path to the module state to load from. NOTE: This must be


Nice. Thanks for preemptively answering my questions ;)

dude @kouroshHakha, why is a path part of the spec?
would we save this path with the serialized module?

No it's for when a user want to load up a new MARL module / part of a marl module from an old (already checkpointed) one via setting this attributed.

I included the path as a part of the MARL Module spec because I want users to be able to load up the module by specifying the path over there. From a ux experience it made the most sense to me.

The path does not get saved when we serialize the spec via a call to to_dict, and therefore isn't included in the checkpoint later.

that's sounds perfect. thanks for the explanation.

kouroshHakha · 2023-05-24T19:32:46Z

rllib/examples/learner/ppo_load_rl_modules.py

Loving the example ...

kouroshHakha · 2023-05-24T19:35:00Z

rllib/core/learner/tests/test_learner_group.py

@@ -363,6 +366,84 @@ def test_save_load_state(self):
 weights_after_1_update_with_break, weights_after_1_update_without_break
 )

+ def test_load_module_state(self):


Just want to point out that the resources / time required to run this test may increase after adding this unittest, If that happens we may have to break the test down to smaller isolated unittest.

kouroshHakha · 2023-05-24T19:42:18Z

rllib/core/learner/learner_group.py

+ agent RLModules take precedence over the module states in the
+ MultiAgentRLModule checkpoint.
+
+ NOTE: At lease one of multi_agent_module_state or single_agent_module_states


I don't get this NOTE: there is no multi_agent_module_state or single_agent_module_states in the args of this method.

These may be left overs of your dev history

release/release_tests.yaml

gjoliver

some nits.

gjoliver · 2023-05-24T21:52:14Z

rllib/core/learner/learner_group.py

+ # also in the RLModule checkpoints.
+ if modules_to_load:
+ for module_id in rl_module_ckpt_dirs.keys():
+ if module_id in modules_to_load:


if any([dir in modeles_to_load for dir in rl_module_ckpt_dirs.keys()])

rllib/core/learner/learner_group.py

gjoliver · 2023-05-24T21:55:33Z

rllib/core/learner/learner_group.py

+ path / RLMODULE_STATE_DIR_NAME
+ )
+ else:
+ assert len(self._workers) == self._worker_manager.num_healthy_actors()


should we write the else logics in a separate util function?

yeah lemme go ahead and do that this function has gotten too big.

rllib/core/learner/learner_group.py

gjoliver · 2023-05-24T21:58:14Z

rllib/core/rl_module/marl_module.py

@@ -403,12 +400,24 @@ class MultiAgentRLModuleSpec:
 module_specs: The module specs for each individual module. It can be either a
 SingleAgentRLModuleSpec used for all module_ids or a dictionary mapping
 from module IDs to SingleAgentRLModuleSpecs for each individual module.
+ load_state_path: The path to the module state to load from. NOTE: This must be


dude @kouroshHakha, why is a path part of the spec?
would we save this path with the serialized module?

Signed-off-by: avnishn <[email protected]>

…/avnishn/ray into add_state_loading_rl_module_spec

…l ci Signed-off-by: avnishn <[email protected]>

…state_loading_rl_module_spec

avnishn · 2023-05-26T01:36:51Z

Ok I finished my todos, but I still need to update with regards to jun's nits, and some of the comments that accidentally got left in, and then this should be ready to go.

Signed-off-by: Avnish <[email protected]>

…state_loading_rl_module_spec

Signed-off-by: Avnish <[email protected]>

…state_loading_rl_module_spec

gjoliver · 2023-05-26T22:48:24Z

rllib/core/learner/learner_group.py

+ for module_id, path in rl_module_ckpt_dirs.items():
+ w.module[module_id].load_state(path / RLMODULE_STATE_DIR_NAME)
+
+ # remove the temporary directories on the worker if any were created


curious, do you really need to remove these, given that we used tempfile for them?

they are a tempfile, but doesn't /tmp/ only get cleared after one week?

I'm not using the with: scope for the tempfile, so the directories won't be automatically removed by tempfile library.

gjoliver · 2023-05-26T22:49:11Z

rllib/core/rl_module/marl_module.py

@@ -403,12 +400,24 @@ class MultiAgentRLModuleSpec:
 module_specs: The module specs for each individual module. It can be either a
 SingleAgentRLModuleSpec used for all module_ids or a dictionary mapping
 from module IDs to SingleAgentRLModuleSpecs for each individual module.
+ load_state_path: The path to the module state to load from. NOTE: This must be


that's sounds perfect. thanks for the explanation.

Signed-off-by: Avnish <[email protected]>

…state_loading_rl_module_spec

avnishn · 2023-05-30T02:21:25Z

This should be good to merge. The tests that failed on ci are unrelated or flakey. The GCE release test failed because the cluster failed to come up, but the AWS cluster came up and the tests passed.

Signed-off-by: Artur Niederfahrenhorst <[email protected]>

…ct#35180)

…ct#35180) Signed-off-by: e428265 <[email protected]>

[RLlib] Load state from load_state_path for rlmodule spec

39d0783

Signed-off-by: Avnish <[email protected]>

avnishn requested review from sven1977, gjoliver, ArturNiederfahrenhorst, smorad, maxpumperla, kouroshHakha and krfricke as code owners May 9, 2023 18:48

avnishn added 4 commits May 9, 2023 13:10

Temp

289f189

Signed-off-by: Avnish <[email protected]>

Temp

d9f7c1f

Signed-off-by: Avnish <[email protected]>

Module spec state loading, change load state save state behavior to a…

6117d79

…ccept a dir instead of a path Signed-off-by: Avnish <[email protected]>

Lint

d1a4fc4

Signed-off-by: Avnish <[email protected]>

gjoliver reviewed May 10, 2023

View reviewed changes

rllib/core/rl_module/marl_module.py Outdated Show resolved Hide resolved

avnishn added 10 commits May 10, 2023 10:28

Add more tests, address comments

033b61e

Signed-off-by: Avnish <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into add_…

5484853

…state_loading_rl_module_spec

Temp

d494a07

Signed-off-by: Avnish <[email protected]>

Temp

de8ae63

Signed-off-by: avnishn <[email protected]>

E2e tests for uncheckpointing modules, tf warning

72cc71a

add end to end tests for the marl module uncheckpointing with the ppo algorithm. Move the kl checking in ppo tf module because it is causing a tf auto graph error for some reason Signed-off-by: avnishn <[email protected]>

Enable release test

7943f8f

Signed-off-by: avnishn <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into add_…

8af1255

…state_loading_rl_module_spec

Fix kl warning

2d202fe

Signed-off-by: avnishn <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into add_…

d70c05f

…state_loading_rl_module_spec

Fix kl key checking

6f7606b

Signed-off-by: Avnish <[email protected]>

gjoliver requested changes May 12, 2023

View reviewed changes

avnishn added 4 commits May 20, 2023 20:46

Merge branch 'master' of https://github.com/ray-project/ray into add_…

552608b

…state_loading_rl_module_spec

Add module loading across nodes to learner group

0077c05

Signed-off-by: Avnish <[email protected]>

Address comments, remove module spec checkpointing uncheckpointing

65f6e6f

Signed-off-by: Avnish <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into add_…

58d4f5d

…state_loading_rl_module_spec

avnishn commented May 22, 2023

View reviewed changes

rllib/algorithms/algorithm.py Show resolved Hide resolved

Undo file modifications

62505c7

Signed-off-by: Avnish <[email protected]>

avnishn added 5 commits May 23, 2023 10:06

Merge branch 'master' of https://github.com/ray-project/ray into add_…

142b128

…state_loading_rl_module_spec

Get end to end loading for single agent rl modules working

63c0647

Signed-off-by: Avnish <[email protected]>

Change resource requiremetns algo config test

bbc0569

Signed-off-by: avnishn <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into add_…

a6d3c53

…state_loading_rl_module_spec

Fix broken algorithm config test'

1c601c2

Signed-off-by: Avnish <[email protected]>

kouroshHakha approved these changes May 24, 2023

View reviewed changes

gjoliver reviewed May 24, 2023

View reviewed changes

avnishn added 2 commits May 25, 2023 10:23

Squash changes into one commit

6ae2fd6

Signed-off-by: avnishn <[email protected]>

Merge branch 'add_state_loading_rl_module_spec' of https://github.com…

a4106cc

…/avnishn/ray into add_state_loading_rl_module_spec

avnishn requested a review from a team as a code owner May 25, 2023 17:24

avnishn added 2 commits May 25, 2023 18:33

Address some comments, add test for modules_to_load, add tests to pul…

396d50c

…l ci Signed-off-by: avnishn <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into add_…

0085ca6

…state_loading_rl_module_spec

avnishn added 5 commits May 26, 2023 12:00

Change pytest to py_test

dafe913

Signed-off-by: Avnish <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into add_…

80b380d

…state_loading_rl_module_spec

Make test large to prevent timeouts from happening

35cb0c3

Signed-off-by: Avnish <[email protected]>

Address comments, add test for checking for modules_to_load error cases

8ccd94a

Signed-off-by: Avnish <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into add_…

f2d4ab7

…state_loading_rl_module_spec

avnishn added the do-not-merge Do not merge this PR! label May 26, 2023

gjoliver approved these changes May 26, 2023

View reviewed changes

avnishn added 2 commits May 29, 2023 14:07

Use dirs on nodes after file transfer

23b1f04

Signed-off-by: Avnish <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into add_…

8d0374d

…state_loading_rl_module_spec

avnishn removed the do-not-merge Do not merge this PR! label May 30, 2023

sven1977 merged commit 83179ab into ray-project:master May 30, 2023
2 checks passed

ArturNiederfahrenhorst added a commit to kouroshHakha/ray that referenced this pull request May 30, 2023

Revert changes from ray-project#35180

a31cbc9

Signed-off-by: Artur Niederfahrenhorst <[email protected]>

ArturNiederfahrenhorst mentioned this pull request May 30, 2023

[RLlib] Enable RL Modules and Learner API for PPO by default #32808

Merged

scv119 pushed a commit to scv119/ray that referenced this pull request Jun 16, 2023

[RLlib] Load state from load_state_path for rlmodule spec. (ray-proje…

6753744

…ct#35180)

arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023

[RLlib] Load state from load_state_path for rlmodule spec. (ray-proje…

5a92c1c

…ct#35180) Signed-off-by: e428265 <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RLlib] Load state from load_state_path for rlmodule spec #35180

[RLlib] Load state from load_state_path for rlmodule spec #35180

avnishn commented May 9, 2023 •

edited

Loading

gjoliver left a comment

avnishn commented May 24, 2023

kouroshHakha left a comment

kouroshHakha May 24, 2023

gjoliver May 24, 2023

kouroshHakha May 25, 2023

avnishn May 26, 2023

gjoliver May 26, 2023

kouroshHakha May 24, 2023

kouroshHakha May 24, 2023

kouroshHakha May 24, 2023

kouroshHakha May 24, 2023

avnishn May 26, 2023

gjoliver left a comment

gjoliver May 24, 2023

avnishn May 26, 2023

gjoliver May 24, 2023

avnishn May 26, 2023

gjoliver May 24, 2023

avnishn commented May 26, 2023

gjoliver May 26, 2023

avnishn May 26, 2023

gjoliver May 26, 2023

avnishn commented May 30, 2023

[RLlib] Load state from load_state_path for rlmodule spec #35180

[RLlib] Load state from load_state_path for rlmodule spec #35180

Conversation

avnishn commented May 9, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

gjoliver left a comment

Choose a reason for hiding this comment

avnishn commented May 24, 2023

kouroshHakha left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gjoliver left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

avnishn commented May 26, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

avnishn commented May 30, 2023

avnishn commented May 9, 2023 •

edited

Loading