[tune] Fix `reuse_actors` error on actor cleanup for function trainables #42951

justinvyu · 2024-02-02T20:34:08Z

Why are these changes needed?

This PR fixes a bug caused by reuse actors. When an actor is successfully scheduled, but the tune controller decides to reuse a cached actor in the meantime, then the newly created actor gets immediately terminated (via ActorManager.remove_actor). This results in calling the stop method of the Trainable, which in the case of function trainables raises an error from trying to join the training thread that hasn't started yet.

This PR also disables reuse_actors by default for all trainable types, due to the nondeterministic and unstable behavior of the number of total actors spawned throughout training. Many user issues come from this unexpected reuse_actors behavior, so the solution for now is to disable it by default. The "many small trial tuning" and using schedulers that pause trials are the 2 use cases most impacted by this default behavior change, but the flag can always be turned on if needed.

This PR also fixes a bug in the actor manager where the number of started actors is not incremented.

Related issue number

Closes #41557
Closes #42334

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Justin Yu <[email protected]>

…rs immediate shutdown case Signed-off-by: Justin Yu <[email protected]>

Signed-off-by: Justin Yu <[email protected]>

matthewdeng

Thanks for the cleanup!

matthewdeng · 2024-02-02T20:59:15Z

python/ray/train/_internal/session.py

+ if self.training_thread.is_alive():
+ self.training_thread.join(timeout=timeout)


Will there be able times where we wouldn't want this to gracefully exit, and instead raise an error when the thread isn't alive?

The only case where we call stop before starting the the training thread is if:

The trainable actor gets scheduled properly.

We never tell the trainable to launch training via Trainable.step.

We want to terminate the trainable.

This only happens in the reuse_actors case where we decide that the actor is not needed after having successfully launched it.

python/ray/train/_internal/session.py

woshiyyya

Nice work!

woshiyyya · 2024-02-02T22:07:18Z

python/ray/air/execution/_internal/actor_manager.py

@@ -401,6 +401,8 @@ def on_error(exception: Exception):

 self._enqueue_cached_actor_tasks(tracked_actor=tracked_actor)

+ started_actors += 1


At the beginning, I am surprised that this didn't cause errors before. But after some investigation, seems that the returned started_actors is not being used in anywhere. Do we still need it as a return value?

The return value is not used anywhere, but the local variable is still used to break out of the loop of attempting to schedule actors.

I think this may not be a huge issue because actor scheduling is still limited by the resources available in the cluster. This max_actors is set to 1 everywhere, so the fact that this counter wasn't working means we have been processing more than one actor scheduling request in one iteration of the event loop.

But that is probably ok:

for i in range(1): # schedule 5 actors # vs. for i in range(5): # schedule 1 actor

Got it. Previously this function could exhaust the cluster resource and try to launch as many actors as possible. We should re-enable the restriction of max_actors.

Signed-off-by: Justin Yu <[email protected]>

…e_actors_error_fix

Signed-off-by: Justin Yu <[email protected]>

…e_actors_error_fix

…les (ray-project#42951) This PR fixes a bug caused by reuse actors stopping `FunctionTrainable` actors before their training thread started. Additionally, this PR disables `reuse_actors` by default for all trainable types, due to the nondeterministic and unstable behavior of the number of total actors spawned throughout training which has been reported by many users. --------- Signed-off-by: Justin Yu <[email protected]> Signed-off-by: tterrysun <[email protected]>

justinvyu added 4 commits February 2, 2024 12:19

count num started actors so that max_actors is respected

6c44eaf

Signed-off-by: Justin Yu <[email protected]>

add check for thread being alive before trying to join for reuse acto…

9392e6f

…rs immediate shutdown case Signed-off-by: Justin Yu <[email protected]>

disable reuse_actors by default for all types of trainables

443f44f

Signed-off-by: Justin Yu <[email protected]>

fix lint

7875918

Signed-off-by: Justin Yu <[email protected]>

justinvyu requested review from matthewdeng and woshiyyya February 2, 2024 20:34

justinvyu assigned matthewdeng and woshiyyya Feb 2, 2024

matthewdeng approved these changes Feb 2, 2024

View reviewed changes

woshiyyya approved these changes Feb 2, 2024

View reviewed changes

justinvyu added 8 commits February 4, 2024 10:55

fix session.finish docstring

f913132

Signed-off-by: Justin Yu <[email protected]>

fix session test to not check return value

a4f7e56

Signed-off-by: Justin Yu <[email protected]>

set reuse_actors=True in pbt examples to avoid test timeout

d777e16

Signed-off-by: Justin Yu <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into reus…

aa133c0

…e_actors_error_fix

Merge branch 'master' of https://github.com/ray-project/ray into reus…

1a13985

…e_actors_error_fix

fix backend todo test (driveby)

d9dcf8d

Signed-off-by: Justin Yu <[email protected]>

add back backend return on finish for tests

4645bae

Signed-off-by: Justin Yu <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into reus…

32bdffa

…e_actors_error_fix

justinvyu merged commit e3ce49a into ray-project:master Feb 5, 2024
9 checks passed

justinvyu deleted the reuse_actors_error_fix branch February 5, 2024 21:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tune] Fix `reuse_actors` error on actor cleanup for function trainables #42951

[tune] Fix `reuse_actors` error on actor cleanup for function trainables #42951

justinvyu commented Feb 2, 2024 •

edited

Loading

matthewdeng left a comment

matthewdeng Feb 2, 2024

justinvyu Feb 3, 2024

woshiyyya left a comment

woshiyyya Feb 2, 2024 •

edited

Loading

justinvyu Feb 3, 2024

woshiyyya Feb 4, 2024

		if self.training_thread.is_alive():
		self.training_thread.join(timeout=timeout)

		@@ -401,6 +401,8 @@ def on_error(exception: Exception):

		self._enqueue_cached_actor_tasks(tracked_actor=tracked_actor)

		started_actors += 1

[tune] Fix reuse_actors error on actor cleanup for function trainables #42951

[tune] Fix reuse_actors error on actor cleanup for function trainables #42951

Conversation

justinvyu commented Feb 2, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

matthewdeng left a comment

Choose a reason for hiding this comment

matthewdeng Feb 2, 2024

Choose a reason for hiding this comment

justinvyu Feb 3, 2024

Choose a reason for hiding this comment

woshiyyya left a comment

Choose a reason for hiding this comment

woshiyyya Feb 2, 2024 • edited Loading

Choose a reason for hiding this comment

justinvyu Feb 3, 2024

Choose a reason for hiding this comment

woshiyyya Feb 4, 2024

Choose a reason for hiding this comment

[tune] Fix `reuse_actors` error on actor cleanup for function trainables #42951

[tune] Fix `reuse_actors` error on actor cleanup for function trainables #42951

justinvyu commented Feb 2, 2024 •

edited

Loading

woshiyyya Feb 2, 2024 •

edited

Loading