[RLlib] Learner group checkpointing #34379

avnishn · 2023-04-13T20:18:12Z

Signed-off-by: Avnish [email protected]

Implement multinode learner group checkpointing and tests.

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Avnish <[email protected]>

…ner_group_checkpointing

Signed-off-by: Avnish <[email protected]>

- stop creating multiple distributed tf strategies - add multinode release test for checkpointing Signed-off-by: avnishn <[email protected]>

Signed-off-by: avnishn <[email protected]>

Signed-off-by: Avnish <[email protected]>

…ner_group_checkpointing

Signed-off-by: Avnish <[email protected]>

sven1977 · 2023-04-18T16:09:41Z

rllib/core/rl_module/rl_module.py

@@ -609,3 +609,7 @@ def as_multi_agent(self) -> "MultiAgentRLModule":
 marl_module = MultiAgentRLModule()
 marl_module.add_module(DEFAULT_POLICY_ID, self)
 return marl_module
+
+ def unwrapped(self) -> "RLModule":
+ """Returns the underlying module if this module is a wrapper."""


Can you specify, what wrapper here means?
Like what are examples for RLModule wrappers?

torch rl modules get wrapped with the torch ddp rl module wrapper

sven1977 · 2023-04-18T16:09:54Z

rllib/core/rl_module/tests/test_marl_module.py

 from ray.rllib.env.multi_agent_env import make_multi_agent
 from ray.rllib.utils.test_utils import check


-DEFAULT_POLICY_ID = "default_policy"


sven1977 · 2023-04-18T16:10:44Z

rllib/core/testing/utils.py

@@ -26,6 +25,9 @@
 Optimizer = Union["tf.keras.optimizers.Optimizer", "torch.optim.Optimizer"]


+DEFAULT_POLICY_ID = "default_policy"


Why don't we import this from policy.py here?

I want to avoid mixing policy code in the new stack.

sven1977 · 2023-04-18T16:32:02Z

rllib/core/learner/tf/tf_learner.py

- # the default strategy is a no-op that can be used in the local mode
- # cpu only case, build will override this if needed.
- self._strategy = tf.distribute.get_strategy()
+ self._strategy = None


Can you leave the comment on what self._strategy is (or should be when not None)?

Strategy is a tf distributed strategy object that is used for the ddp logic.

added a param notation.

sven1977 · 2023-04-18T16:32:50Z

rllib/core/learner/tf/tf_learner.py

@@ -349,6 +347,25 @@ def remove_module(self, module_id: ModuleID) -> None:
 if self._enable_tf_function:
 self._update_fn = tf.function(self._do_update_fn, reduce_retracing=True)

+ def _make_distributed_strategy(self):
+ """Create a distributed strategy for the learner."""


Same, can you add a little more explanation here on what a "strategy" is and which types exist (an example?)?

Strategy is a tf distributed strategy object.

The different types of strategies are contained within the function.

sven1977 · 2023-04-18T16:35:27Z

rllib/core/rl_module/marl_module.py

@@ -483,7 +483,7 @@ def from_module(self, module: MultiAgentRLModule) -> "MultiAgentRLModuleSpec":
 The MultiAgentRLModuleSpec.
 """
 module_specs = {
- module_id: SingleAgentRLModuleSpec.from_module(rl_module)
+ module_id: SingleAgentRLModuleSpec.from_module(rl_module.unwrapped())


Nit: Explain why we need to unwrap here. rl_module could be a framework-specific DDP wrapper?

sven1977 · 2023-04-18T16:36:41Z

rllib/core/learner/tests/test_learner_group.py

@@ -19,7 +25,7 @@

 REMOTE_SCALING_CONFIGS = {
 "remote-cpu": LearnerGroupScalingConfig(num_workers=1),
- "remote-gpu": LearnerGroupScalingConfig(num_workers=1, num_gpus_per_worker=0.5),
+ "remote-gpu": LearnerGroupScalingConfig(num_workers=1, num_gpus_per_worker=1),


Why did we change this? Would it break if we used fractional GPUs here?

this learner/group actually won't even take fractional gpus. so it is pointless. I changed it while I was doing some debugging

sven1977 · 2023-04-18T17:04:15Z

rllib/core/learner/tests/test_learner_group.py

+ learner_group.load_state(initial_learner_checkpoint_dir)
+ check(learner_group.get_weights(), initial_learner_group_weights)
+ learner_group.update(batch.as_multi_agent(), reduce_fn=None)
+ results_without_break = learner_group.update(


Could we check here again to see whether the weights after one update (based off the initial state) are the same as the weights of the original learner (after one update)?

sven1977

Awesome PR @avnish! Thanks for covering this important feature in our release tests from here on.
Just a few nits, questions, and suggestions for better comments.

Signed-off-by: Avnish <[email protected]>

…ner_group_checkpointing

sven1977

LGTM now. Thanks!

gjoliver

let me talk to you offline about how you intend to use this?

gjoliver · 2023-04-18T20:06:19Z

rllib/core/learner/learner_group.py

+ def remove_dir(w):
+ import shutil
+
+ shutil.rmtree(worker_temp_dir)


can you make this a member function on Worker as well?
so you can do lambda w: w.remove_worker_temp_dir() below.

avnishn · 2023-04-18T20:21:12Z

rllib/core/learner/learner_group.py

+ import socket
+ import tempfile
+
+ hostname = socket.gethostname()


ray.util.get_node_ip

Implement multinode learner group checkpointing and tests. --------- Signed-off-by: Avnish <[email protected]> Signed-off-by: avnishn <[email protected]> Signed-off-by: elliottower <[email protected]>

Implement multinode learner group checkpointing and tests. --------- Signed-off-by: Avnish <[email protected]> Signed-off-by: avnishn <[email protected]> Signed-off-by: Jack He <[email protected]>

avnishn added 4 commits April 12, 2023 19:49

Initial commit:

22f4d97

Signed-off-by: Avnish <[email protected]>

Temp

42afc7c

Signed-off-by: Avnish <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into lear…

bbddd27

…ner_group_checkpointing

Working for 1 cpu distributed -- need to test on multinode

e2afe87

Signed-off-by: Avnish <[email protected]>

avnishn requested review from sven1977, gjoliver, ArturNiederfahrenhorst, smorad, maxpumperla, kouroshHakha and krfricke as code owners April 13, 2023 20:18

avnishn added 7 commits April 13, 2023 18:50

Make learner checkpointing work on multinode

4217faa

- stop creating multiple distributed tf strategies - add multinode release test for checkpointing Signed-off-by: avnishn <[email protected]>

Fix bugs

a485944

Signed-off-by: avnishn <[email protected]>

Fix load state, make all actors run

6a45d20

Signed-off-by: Avnish <[email protected]>

Fix release testing path

0c15398

Signed-off-by: Avnish <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into lear…

fbce4b0

…ner_group_checkpointing

Fix script with broken imports

81d1ba1

Signed-off-by: Avnish <[email protected]>

Move import from out of tests dir

ee27fcd

Signed-off-by: Avnish <[email protected]>

sven1977 reviewed Apr 18, 2023

View reviewed changes

avnishn added 2 commits April 18, 2023 10:53

Address comments

35a7621

Signed-off-by: Avnish <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into lear…

3318485

…ner_group_checkpointing

sven1977 approved these changes Apr 18, 2023

View reviewed changes

amogkam merged commit 4995e14 into ray-project:master Apr 18, 2023

gjoliver reviewed Apr 18, 2023

View reviewed changes

avnishn commented Apr 18, 2023

View reviewed changes

rllib/core/learner/learner_group.py

import socket

import tempfile

hostname = socket.gethostname()

Copy link

Member Author

avnishn Apr 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ray.util.get_node_ip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RLlib] Learner group checkpointing #34379

[RLlib] Learner group checkpointing #34379

avnishn commented Apr 13, 2023

sven1977 Apr 18, 2023

avnishn Apr 18, 2023

avnishn Apr 18, 2023

sven1977 Apr 18, 2023

sven1977 Apr 18, 2023

avnishn Apr 18, 2023

sven1977 Apr 18, 2023

avnishn Apr 18, 2023

avnishn Apr 18, 2023

avnishn Apr 18, 2023

sven1977 Apr 18, 2023

avnishn Apr 18, 2023

avnishn Apr 18, 2023

sven1977 Apr 18, 2023

avnishn Apr 18, 2023

sven1977 Apr 18, 2023

avnishn Apr 18, 2023

sven1977 Apr 18, 2023

sven1977 left a comment

sven1977 left a comment

gjoliver left a comment

gjoliver Apr 18, 2023

avnishn Apr 18, 2023

		@@ -26,6 +25,9 @@
		Optimizer = Union["tf.keras.optimizers.Optimizer", "torch.optim.Optimizer"]


		DEFAULT_POLICY_ID = "default_policy"

[RLlib] Learner group checkpointing #34379

[RLlib] Learner group checkpointing #34379

Conversation

avnishn commented Apr 13, 2023

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sven1977 left a comment

Choose a reason for hiding this comment

sven1977 left a comment

Choose a reason for hiding this comment

gjoliver left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment