[RLlib] Add Optimizer State To Learner get_state #34760

avnishn · 2023-04-25T20:47:46Z

As apart of todos, add the optimizer state to the learner's get state.

TBF. I don't know who is going to ever need the optimizer state at runtime, other than for testing, but now at least we support it, for completeness.

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Avnish <[email protected]>

avnishn · 2023-04-25T20:48:50Z

rllib/core/learner/learner.py

@@ -852,7 +852,12 @@ def set_state(self, state: Mapping[str, Any]) -> None:
 # having both can become confusing. Can we simplify this API requirement?
 self._check_is_built()
 # TODO: once we figure out the optimizer format, we can set/get the state
- self._module.set_state(state.get("module_state", {}))
+ module_state = state["module_state"]


I could probably check for the existence of these keys first then error out if necessary.

avnishn · 2023-04-25T20:49:30Z

rllib/core/learner/tests/test_learner_group.py

@@ -36,7 +36,7 @@

 LOCAL_SCALING_CONFIGS = {
 "local-cpu": LearnerGroupScalingConfig(num_workers=0, num_gpus_per_worker=0),
- "local-gpu": LearnerGroupScalingConfig(num_workers=0, num_gpus_per_worker=0.5),
+ "local-gpu": LearnerGroupScalingConfig(num_workers=0, num_gpus_per_worker=1),


We don't actually support fractional gpu, so this doesn't matter.

avnishn · 2023-04-25T20:50:40Z

rllib/core/learner/tf/tf_learner.py

@@ -267,6 +267,25 @@ def _load_optimizers(self, path: Union[str, pathlib.Path]) -> None:
 def set_weights(self, weights: Mapping[str, Any]) -> None:
 self._module.set_state(weights)

+ @override(Learner)
+ def get_optimizer_weights(self) -> Mapping[str, Any]:


I'm trying to find a way to reuse these functions when saving the optimizer state, but its difficult since there is actually little overlap -- when saving the optimizer state, we actually save in native tensorflow format instead of numpy.

avnishn · 2023-04-25T20:52:01Z

rllib/utils/torch_utils.py

@@ -172,6 +172,21 @@ def mapping(item):
 return tree.map_structure(mapping, x)


+@PublicAPI
+def copy_and_move_to_device(x: TensorStructType, device: Optional[str] = None):


I tried using convert_to_torch_tensor when reloading optimizer state dicts, but it was doing something funny that was causing some of my types to get improperly cast, which caused a precision error down the line. Instead I created this function, which probably also deserves its own test.

Is the funny thing something we should fix? Not saying that we should, just asking for your optinion.

rllib/core/learner/tests/test_learner_group.py

ArturNiederfahrenhorst · 2023-04-25T20:59:50Z

rllib/utils/torch_utils.py

@@ -172,6 +172,21 @@ def mapping(item):
 return tree.map_structure(mapping, x)


+@PublicAPI
+def copy_and_move_to_device(x: TensorStructType, device: Optional[str] = None):


Can we add docstrings here to clarify what sort of copy this is?
Also add what happens to items that are not torch Tensors?
Are some of the optimizer weights not torch tensors? Because usually I'd expect this to error out if elements of the tensorstruct are not torch tensors.

I've added a docstring, hoping it adds enough clarity :)

ArturNiederfahrenhorst

Left some nits :) Thanks for the PR!

Signed-off-by: Avnish <[email protected]>

rllib/utils/tests/test_torch_utils.py

Signed-off-by: Avnish <[email protected]>

…optimizer_state_to_learner_get_state

Signed-off-by: avnishn <[email protected]>

ArturNiederfahrenhorst

Cool! Thanks for the additional util! 😃

Signed-off-by: Avnish <[email protected]> Signed-off-by: Jack He <[email protected]>

Signed-off-by: Avnish <[email protected]>

avnishn added 2 commits April 25, 2023 11:57

Temp

41a21b9

Signed-off-by: Avnish <[email protected]>

enable eager execution before running the tests

7094752

Signed-off-by: Avnish <[email protected]>

avnishn requested review from sven1977, gjoliver, ArturNiederfahrenhorst, smorad, maxpumperla, kouroshHakha and krfricke as code owners April 25, 2023 20:47

avnishn assigned ArturNiederfahrenhorst Apr 25, 2023

avnishn commented Apr 25, 2023

View reviewed changes

ArturNiederfahrenhorst reviewed Apr 25, 2023

View reviewed changes

rllib/core/learner/tests/test_learner_group.py Show resolved Hide resolved

ArturNiederfahrenhorst reviewed Apr 25, 2023

View reviewed changes

rllib/core/learner/tests/test_learner_group.py Outdated Show resolved Hide resolved

ArturNiederfahrenhorst reviewed Apr 25, 2023

View reviewed changes

Add tests and address feedback

b351ba3

Signed-off-by: Avnish <[email protected]>

ArturNiederfahrenhorst reviewed Apr 26, 2023

View reviewed changes

rllib/utils/tests/test_torch_utils.py Show resolved Hide resolved

avnishn added 3 commits April 26, 2023 15:37

Update test for checking if device functionality works

4502c6a

Signed-off-by: Avnish <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into add_…

1f28b2f

…optimizer_state_to_learner_get_state

Fix to test

41edde5

Signed-off-by: avnishn <[email protected]>

ArturNiederfahrenhorst approved these changes Apr 27, 2023

View reviewed changes

gjoliver approved these changes Apr 28, 2023

View reviewed changes

gjoliver merged commit 0d59be7 into ray-project:master Apr 28, 2023

ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request May 4, 2023

[RLlib] Add Optimizer State To Learner get_state (ray-project#34760)

365c16e

Signed-off-by: Avnish <[email protected]> Signed-off-by: Jack He <[email protected]>

architkulkarni pushed a commit to architkulkarni/ray that referenced this pull request May 16, 2023

[RLlib] Add Optimizer State To Learner get_state (ray-project#34760)

014111a

Signed-off-by: Avnish <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RLlib] Add Optimizer State To Learner get_state #34760

[RLlib] Add Optimizer State To Learner get_state #34760

avnishn commented Apr 25, 2023

avnishn Apr 25, 2023

avnishn Apr 25, 2023

avnishn Apr 25, 2023

avnishn Apr 25, 2023

ArturNiederfahrenhorst Apr 26, 2023

ArturNiederfahrenhorst Apr 25, 2023

avnishn Apr 26, 2023

ArturNiederfahrenhorst left a comment

ArturNiederfahrenhorst left a comment

[RLlib] Add Optimizer State To Learner get_state #34760

[RLlib] Add Optimizer State To Learner get_state #34760

Conversation

avnishn commented Apr 25, 2023

Why are these changes needed?

Related issue number

Checks

avnishn Apr 25, 2023

Choose a reason for hiding this comment

avnishn Apr 25, 2023

Choose a reason for hiding this comment

avnishn Apr 25, 2023

Choose a reason for hiding this comment

avnishn Apr 25, 2023

Choose a reason for hiding this comment

ArturNiederfahrenhorst Apr 26, 2023

Choose a reason for hiding this comment

ArturNiederfahrenhorst Apr 25, 2023

Choose a reason for hiding this comment

avnishn Apr 26, 2023

Choose a reason for hiding this comment

ArturNiederfahrenhorst left a comment

Choose a reason for hiding this comment

ArturNiederfahrenhorst left a comment

Choose a reason for hiding this comment