Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Post _on_completed callbacks to io_service_ instead of running them synchronously #39112

Merged
merged 5 commits into from
Aug 31, 2023

Conversation

edoakes
Copy link
Contributor

@edoakes edoakes commented Aug 30, 2023

Why are these changes needed?

Attempts to fix the deadlock described in the linked issue by posting callbacks to the io_service_. This avoids running them while holding locks (such as from the task manager).

Related issue number

#39113 (will be closed once cherry-picked).

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
@edoakes
Copy link
Contributor Author

edoakes commented Aug 30, 2023

Kicked off release tests for core and serve: https://buildkite.com/ray-project/release-tests-pr/builds/51486

@rkooo567
Copy link
Contributor

Test test seems pretty promising

@rkooo567
Copy link
Contributor

rkooo567 commented Aug 30, 2023

previous release tests are here; https://buildkite.com/ray-project/release-tests-pr/builds/51486

(note: I didn't finish some of long running tests to run core tests)

@rkooo567
Copy link
Contributor

rkooo567 commented Aug 31, 2023

@edoakes can you take a look at failures from https://buildkite.com/ray-project/release-tests-pr/builds/51486#018a485f-94f3-4afa-96a5-3f3ea7cab789? (from logs its failures are pretty mysterious)

@rkooo567
Copy link
Contributor

Rerunning it with core https://buildkite.com/ray-project/release-tests-pr/builds/51571. I hope the wheel is correct this time...

@rkooo567
Copy link
Contributor

Test result so far seems pretty promising

Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The test result seems promising

src/ray/core_worker/core_worker.cc Outdated Show resolved Hide resolved
src/ray/core_worker/core_worker.cc Outdated Show resolved Hide resolved
src/ray/core_worker/core_worker.cc Outdated Show resolved Hide resolved
@edoakes edoakes changed the title [WIP][core] Post _on_completed callbacks to io_service_ instead of running them synchronously [core] Post _on_completed callbacks to io_service_ instead of running them synchronously Aug 31, 2023
Signed-off-by: Edward Oakes <[email protected]>
@edoakes
Copy link
Contributor Author

edoakes commented Aug 31, 2023

@rkooo567 there are a lot of failures that have tracebacks like this:

28338 ^[[2m^[[36m(pid=149561)^[[0m Traceback (most recent call last):
28339 ^[[2m^[[36m(pid=149561)^[[0m   File "python/ray/_raylet.pyx", line 1997, in ray._raylet.task_execution_handler
28340 ^[[2m^[[36m(pid=149561)^[[0m   File "python/ray/_raylet.pyx", line 1853, in ray._raylet.execute_task_with_cancellation_handler
28341 ^[[2m^[[36m(pid=149561)^[[0m   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/function_manager.py", line 574, in load_actor_class
28342 ^[[2m^[[36m(pid=149561)^[[0m     actor_class = self._load_actor_class_from_gcs(
28343 ^[[2m^[[36m(pid=149561)^[[0m   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/function_manager.py", line 677, in _load_actor_class_from_gcs
28344 ^[[2m^[[36m(pid=149561)^[[0m     actor_class = pickle.loads(pickled_class)
28345 ^[[2m^[[36m(pid=149561)^[[0m   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/serve/__init__.py", line 4, in <module>
28346 ^[[2m^[[36m(pid=149561)^[[0m     from ray.serve.api import (
28347 ^[[2m^[[36m(pid=149561)^[[0m   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/serve/api.py", line 15, in <module>
28348 ^[[2m^[[36m(pid=149561)^[[0m     from ray.serve.built_application import BuiltApplication
28349 ^[[2m^[[36m(pid=149561)^[[0m   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/serve/built_application.py", line 7, in <module>
28350 ^[[2m^[[36m(pid=149561)^[[0m     from ray.serve.deployment import Deployment
28351 ^[[2m^[[36m(pid=149561)^[[0m   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/serve/deployment.py", line 24, in <module>
28352 ^[[2m^[[36m(pid=149561)^[[0m     from ray.serve.context import get_global_client
28353 ^[[2m^[[36m(pid=149561)^[[0m   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/serve/context.py", line 12, in <module>
28354 ^[[2m^[[36m(pid=149561)^[[0m     from ray.serve._private.client import ServeControllerClient
28355 ^[[2m^[[36m(pid=149561)^[[0m   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/serve/_private/client.py", line 27, in <module>
28356 ^[[2m^[[36m(pid=149561)^[[0m     from ray.serve._private.deploy_utils import get_deploy_args
28357 ^[[2m^[[36m(pid=149561)^[[0m   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/serve/_private/deploy_utils.py", line 8, in <module>
28358 ^[[2m^[[36m(pid=149561)^[[0m     from ray.serve.schema import ServeApplicationSchema
28359 ^[[2m^[[36m(pid=149561)^[[0m   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/serve/schema.py", line 856, in <module>
28360 ^[[2m^[[36m(pid=149561)^[[0m     class ServeInstanceDetails(BaseModel, extra=Extra.forbid):
28361 ^[[2m^[[36m(pid=149561)^[[0m   File "pydantic/main.py", line 197, in pydantic.main.ModelMetaclass.__new__
28362 ^[[2m^[[36m(pid=149561)^[[0m   File "pydantic/fields.py", line 497, in pydantic.fields.ModelField.infer
28363 ^[[2m^[[36m(pid=149561)^[[0m   File "pydantic/fields.py", line 460, in pydantic.fields.ModelField._get_field_info
28364 ^[[2m^[[36m(pid=149561)^[[0m   File "pydantic/typing.py", line 120, in pydantic.typing.get_origin
28365 ^[[2m^[[36m(pid=149561)^[[0m   File "/home/ray/anaconda3/lib/python3.8/typing.py", line 1271, in get_origin
28366 ^[[2m^[[36m(pid=149561)^[[0m     def get_origin(tp):
28367 ^[[2m^[[36m(pid=149561)^[[0m   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/worker.py", line 778, in sigterm_handler
28368 ^[[2m^[[36m(pid=149561)^[[0m     sys.exit(1)
28369 ^[[2m^[[36m(pid=149561)^[[0m SystemExit: 1
28370 ^[[2m^[[36m(pid=149561)^[[0m AttributeError: 'Worker' object has no attribute 'core_worker'

Looks sort of suspicious (some others had this raised in terminate_asyncio_thread which looks related). Let me check other recent runs to see if this is new.

@edoakes
Copy link
Contributor Author

edoakes commented Aug 31, 2023

Same tracebacks are happening on the release branch, so I'm not concerned about them: https://buildkite.com/ray-project/release-tests-branch/builds/2103#018a44b2-6415-495b-8823-7824101415f8

Note that this test also forcibly kills replicas constantly, so seeing some tracebacks is expected.

@edoakes
Copy link
Contributor Author

edoakes commented Aug 31, 2023

All tests also passed on the serve cancellation PR with this change merged in: https://buildkite.com/ray-project/oss-ci-build-pr/builds/34426

@edoakes edoakes merged commit 9350b16 into ray-project:master Aug 31, 2023
2 checks passed
edoakes added a commit to edoakes/ray that referenced this pull request Aug 31, 2023
…ning them synchronously (ray-project#39112)

Fixes the deadlock described in the linked issue by posting callbacks to the `io_service_`. This avoids running them while holding locks (such as from the task manager).
edoakes added a commit to edoakes/ray that referenced this pull request Aug 31, 2023
…ning them synchronously (ray-project#39112)

Fixes the deadlock described in the linked issue by posting callbacks to the `io_service_`. This avoids running them while holding locks (such as from the task manager).

Signed-off-by: Edward Oakes <[email protected]>
GeneDer pushed a commit that referenced this pull request Sep 1, 2023
* [core] Decrement `ObjectRef._on_completed` callback when callback errors (#38673)

We manually increment and decrement the Python refcount for the callback to avoid Python GC'ing it while it's stored in the core worker. However, the decrement logic is skipped if the callback raises an exception. This change adds a `try: finally` block to always decrement the ref count.

The added test case fails without the change and passes w/ it.

Signed-off-by: Edward Oakes <[email protected]>

* [core] Post `_on_completed` callbacks to `io_service_` instead of running them synchronously (#39112)

Fixes the deadlock described in the linked issue by posting callbacks to the `io_service_`. This avoids running them while holding locks (such as from the task manager).

Signed-off-by: Edward Oakes <[email protected]>

---------

Signed-off-by: Edward Oakes <[email protected]>
LeonLuttenberger pushed a commit to jaidisido/ray that referenced this pull request Sep 5, 2023
…ning them synchronously (ray-project#39112)

Fixes the deadlock described in the linked issue by posting callbacks to the `io_service_`. This avoids running them while holding locks (such as from the task manager).
harborn pushed a commit to harborn/ray that referenced this pull request Sep 8, 2023
…ning them synchronously (ray-project#39112)

Fixes the deadlock described in the linked issue by posting callbacks to the `io_service_`. This avoids running them while holding locks (such as from the task manager).
jimthompson5802 pushed a commit to jimthompson5802/ray that referenced this pull request Sep 12, 2023
…ning them synchronously (ray-project#39112)

Fixes the deadlock described in the linked issue by posting callbacks to the `io_service_`. This avoids running them while holding locks (such as from the task manager).

Signed-off-by: Jim Thompson <[email protected]>
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
…ning them synchronously (ray-project#39112)

Fixes the deadlock described in the linked issue by posting callbacks to the `io_service_`. This avoids running them while holding locks (such as from the task manager).

Signed-off-by: Victor <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants