-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tune] Fix reuse_actors
error on actor cleanup for function trainables
#42951
[tune] Fix reuse_actors
error on actor cleanup for function trainables
#42951
Conversation
Signed-off-by: Justin Yu <[email protected]>
…rs immediate shutdown case Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the cleanup!
if self.training_thread.is_alive(): | ||
self.training_thread.join(timeout=timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will there be able times where we wouldn't want this to gracefully exit, and instead raise an error when the thread isn't alive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only case where we call stop before starting the the training thread is if:
- The trainable actor gets scheduled properly.
- We never tell the trainable to launch training via
Trainable.step
. - We want to terminate the trainable.
This only happens in the reuse_actors
case where we decide that the actor is not needed after having successfully launched it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
@@ -401,6 +401,8 @@ def on_error(exception: Exception): | |||
|
|||
self._enqueue_cached_actor_tasks(tracked_actor=tracked_actor) | |||
|
|||
started_actors += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the beginning, I am surprised that this didn't cause errors before. But after some investigation, seems that the returned started_actors
is not being used in anywhere. Do we still need it as a return value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return value is not used anywhere, but the local variable is still used to break out of the loop of attempting to schedule actors.
I think this may not be a huge issue because actor scheduling is still limited by the resources available in the cluster. This max_actors
is set to 1 everywhere, so the fact that this counter wasn't working means we have been processing more than one actor scheduling request in one iteration of the event loop.
But that is probably ok:
for i in range(1):
# schedule 5 actors
# vs.
for i in range(5):
# schedule 1 actor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Previously this function could exhaust the cluster resource and try to launch as many actors as possible. We should re-enable the restriction of max_actors
.
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
…e_actors_error_fix
…e_actors_error_fix
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
…e_actors_error_fix
…les (ray-project#42951) This PR fixes a bug caused by reuse actors stopping `FunctionTrainable` actors before their training thread started. Additionally, this PR disables `reuse_actors` by default for all trainable types, due to the nondeterministic and unstable behavior of the number of total actors spawned throughout training which has been reported by many users. --------- Signed-off-by: Justin Yu <[email protected]> Signed-off-by: tterrysun <[email protected]>
Why are these changes needed?
This PR fixes a bug caused by reuse actors. When an actor is successfully scheduled, but the tune controller decides to reuse a cached actor in the meantime, then the newly created actor gets immediately terminated (via
ActorManager.remove_actor
). This results in calling thestop
method of the Trainable, which in the case of function trainables raises an error from trying to join the training thread that hasn't started yet.This PR also disables reuse_actors by default for all trainable types, due to the nondeterministic and unstable behavior of the number of total actors spawned throughout training. Many user issues come from this unexpected
reuse_actors
behavior, so the solution for now is to disable it by default. The "many small trial tuning" and using schedulers that pause trials are the 2 use cases most impacted by this default behavior change, but the flag can always be turned on if needed.This PR also fixes a bug in the actor manager where the number of started actors is not incremented.
Related issue number
Closes #41557
Closes #42334
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.