Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune] Fix reuse_actors error on actor cleanup for function trainables #42951

Merged
merged 12 commits into from
Feb 5, 2024

Conversation

justinvyu
Copy link
Contributor

@justinvyu justinvyu commented Feb 2, 2024

Why are these changes needed?

This PR fixes a bug caused by reuse actors. When an actor is successfully scheduled, but the tune controller decides to reuse a cached actor in the meantime, then the newly created actor gets immediately terminated (via ActorManager.remove_actor). This results in calling the stop method of the Trainable, which in the case of function trainables raises an error from trying to join the training thread that hasn't started yet.

This PR also disables reuse_actors by default for all trainable types, due to the nondeterministic and unstable behavior of the number of total actors spawned throughout training. Many user issues come from this unexpected reuse_actors behavior, so the solution for now is to disable it by default. The "many small trial tuning" and using schedulers that pause trials are the 2 use cases most impacted by this default behavior change, but the flag can always be turned on if needed.

This PR also fixes a bug in the actor manager where the number of started actors is not incremented.

Related issue number

Closes #41557
Closes #42334

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Contributor

@matthewdeng matthewdeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the cleanup!

Comment on lines 248 to 249
if self.training_thread.is_alive():
self.training_thread.join(timeout=timeout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will there be able times where we wouldn't want this to gracefully exit, and instead raise an error when the thread isn't alive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only case where we call stop before starting the the training thread is if:

  • The trainable actor gets scheduled properly.
  • We never tell the trainable to launch training via Trainable.step.
  • We want to terminate the trainable.

This only happens in the reuse_actors case where we decide that the actor is not needed after having successfully launched it.

python/ray/train/_internal/session.py Show resolved Hide resolved
Copy link
Member

@woshiyyya woshiyyya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

@@ -401,6 +401,8 @@ def on_error(exception: Exception):

self._enqueue_cached_actor_tasks(tracked_actor=tracked_actor)

started_actors += 1
Copy link
Member

@woshiyyya woshiyyya Feb 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the beginning, I am surprised that this didn't cause errors before. But after some investigation, seems that the returned started_actors is not being used in anywhere. Do we still need it as a return value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return value is not used anywhere, but the local variable is still used to break out of the loop of attempting to schedule actors.

I think this may not be a huge issue because actor scheduling is still limited by the resources available in the cluster. This max_actors is set to 1 everywhere, so the fact that this counter wasn't working means we have been processing more than one actor scheduling request in one iteration of the event loop.

But that is probably ok:

for i in range(1):
    # schedule 5 actors

# vs.

for i in range(5):
    # schedule 1 actor

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Previously this function could exhaust the cluster resource and try to launch as many actors as possible. We should re-enable the restriction of max_actors.

@justinvyu justinvyu merged commit e3ce49a into ray-project:master Feb 5, 2024
9 checks passed
@justinvyu justinvyu deleted the reuse_actors_error_fix branch February 5, 2024 21:04
tterrysun pushed a commit to tterrysun/ray that referenced this pull request Feb 14, 2024
…les (ray-project#42951)

This PR fixes a bug caused by reuse actors stopping `FunctionTrainable` actors before their training thread started. Additionally, this PR disables `reuse_actors` by default for all trainable types, due to the nondeterministic and unstable behavior of the number of total actors spawned throughout training which has been reported by many users.

---------

Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: tterrysun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants