Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib] Load state from load_state_path for rlmodule spec #35180

Merged
merged 44 commits into from
May 30, 2023

Conversation

avnishn
Copy link
Member

@avnishn avnishn commented May 9, 2023

Signed-off-by: Avnish [email protected]

Add ability for rl module and marl modules to be created and their states be loaded immediately via the rl module spec.

Add tests for basic spec loading, and multinode uncheckpointing.

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Avnish <[email protected]>
Signed-off-by: Avnish <[email protected]>
Signed-off-by: Avnish <[email protected]>
avnishn added 10 commits May 10, 2023 10:28
Signed-off-by: Avnish <[email protected]>
Signed-off-by: avnishn <[email protected]>
add end to end tests for the marl module uncheckpointing with the ppo
algorithm. Move the kl checking in ppo tf module
because it is causing a tf auto graph error for some reason

Signed-off-by: avnishn <[email protected]>
Signed-off-by: avnishn <[email protected]>
Signed-off-by: avnishn <[email protected]>
Signed-off-by: Avnish <[email protected]>
Copy link
Member

@gjoliver gjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

had a couple of questions I want to answer first.

release/release_tests.yaml Outdated Show resolved Hide resolved
rllib/algorithms/algorithm_config.py Outdated Show resolved Hide resolved
rllib/core/rl_module/marl_module.py Outdated Show resolved Hide resolved
@avnishn
Copy link
Member Author

avnishn commented May 24, 2023

TODOS:

  • Add tests for if someone specifies modules to load
  • make release tests also run on the pull ci

Copy link
Contributor

@kouroshHakha kouroshHakha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice PR man. I really enjoyed it. I have a couple of qs, nothing that is merge blockers tho. Let me know when I should push the button.

@@ -403,12 +400,24 @@ class MultiAgentRLModuleSpec:
module_specs: The module specs for each individual module. It can be either a
SingleAgentRLModuleSpec used for all module_ids or a dictionary mapping
from module IDs to SingleAgentRLModuleSpecs for each individual module.
load_state_path: The path to the module state to load from. NOTE: This must be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Thanks for preemptively answering my questions ;)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dude @kouroshHakha, why is a path part of the spec?
would we save this path with the serialized module?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it's for when a user want to load up a new MARL module / part of a marl module from an old (already checkpointed) one via setting this attributed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I included the path as a part of the MARL Module spec because I want users to be able to load up the module by specifying the path over there. From a ux experience it made the most sense to me.

The path does not get saved when we serialize the spec via a call to to_dict, and therefore isn't included in the checkpoint later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's sounds perfect. thanks for the explanation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loving the example ...

@@ -363,6 +366,84 @@ def test_save_load_state(self):
weights_after_1_update_with_break, weights_after_1_update_without_break
)

def test_load_module_state(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to point out that the resources / time required to run this test may increase after adding this unittest, If that happens we may have to break the test down to smaller isolated unittest.

agent RLModules take precedence over the module states in the
MultiAgentRLModule checkpoint.

NOTE: At lease one of multi_agent_module_state or single_agent_module_states
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get this NOTE: there is no multi_agent_module_state or single_agent_module_states in the args of this method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These may be left overs of your dev history

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

release/release_tests.yaml Show resolved Hide resolved
Copy link
Member

@gjoliver gjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some nits.

# also in the RLModule checkpoints.
if modules_to_load:
for module_id in rl_module_ckpt_dirs.keys():
if module_id in modules_to_load:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if any([dir in modeles_to_load for dir in rl_module_ckpt_dirs.keys()])

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

rllib/core/learner/learner_group.py Show resolved Hide resolved
path / RLMODULE_STATE_DIR_NAME
)
else:
assert len(self._workers) == self._worker_manager.num_healthy_actors()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we write the else logics in a separate util function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah lemme go ahead and do that this function has gotten too big.

rllib/core/learner/learner_group.py Outdated Show resolved Hide resolved
@@ -403,12 +400,24 @@ class MultiAgentRLModuleSpec:
module_specs: The module specs for each individual module. It can be either a
SingleAgentRLModuleSpec used for all module_ids or a dictionary mapping
from module IDs to SingleAgentRLModuleSpecs for each individual module.
load_state_path: The path to the module state to load from. NOTE: This must be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dude @kouroshHakha, why is a path part of the spec?
would we save this path with the serialized module?

@avnishn avnishn requested a review from a team as a code owner May 25, 2023 17:24
@avnishn
Copy link
Member Author

avnishn commented May 26, 2023

Ok I finished my todos, but I still need to update with regards to jun's nits, and some of the comments that accidentally got left in, and then this should be ready to go.

@avnishn avnishn added the do-not-merge Do not merge this PR! label May 26, 2023
for module_id, path in rl_module_ckpt_dirs.items():
w.module[module_id].load_state(path / RLMODULE_STATE_DIR_NAME)

# remove the temporary directories on the worker if any were created
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious, do you really need to remove these, given that we used tempfile for them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they are a tempfile, but doesn't /tmp/ only get cleared after one week?

I'm not using the with: scope for the tempfile, so the directories won't be automatically removed by tempfile library.

@@ -403,12 +400,24 @@ class MultiAgentRLModuleSpec:
module_specs: The module specs for each individual module. It can be either a
SingleAgentRLModuleSpec used for all module_ids or a dictionary mapping
from module IDs to SingleAgentRLModuleSpecs for each individual module.
load_state_path: The path to the module state to load from. NOTE: This must be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's sounds perfect. thanks for the explanation.

@avnishn
Copy link
Member Author

avnishn commented May 30, 2023

This should be good to merge. The tests that failed on ci are unrelated or flakey. The GCE release test failed because the cluster failed to come up, but the AWS cluster came up and the tests passed.

@avnishn avnishn removed the do-not-merge Do not merge this PR! label May 30, 2023
@sven1977 sven1977 merged commit 83179ab into ray-project:master May 30, 2023
2 checks passed
ArturNiederfahrenhorst added a commit to kouroshHakha/ray that referenced this pull request May 30, 2023
Signed-off-by: Artur Niederfahrenhorst <[email protected]>
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants