Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] linux:https://:gcs_client_test is failing/flaky on master. #34344

Closed
cadedaniel opened this issue Apr 12, 2023 · 1 comment · Fixed by #34411 or #34656
Closed

[CI] linux:https://:gcs_client_test is failing/flaky on master. #34344

cadedaniel opened this issue Apr 12, 2023 · 1 comment · Fixed by #34411 or #34656
Assignees
Labels
core Issues that should be addressed in Ray Core

Comments

@cadedaniel
Copy link
Member

....
Generated from flaky test tracker. Please do not edit the signature in this section.
DataCaseName-linux:https://:gcs_client_test-END
....

@cadedaniel cadedaniel added flaky-tracker Issue created via Flaky Test Tracker https://flaky-tests.ray.io/ core Issues that should be addressed in Ray Core labels Apr 12, 2023
@cadedaniel cadedaniel self-assigned this Apr 12, 2023
@cadedaniel cadedaniel linked a pull request Apr 14, 2023 that will close this issue
@rickyyx
Copy link
Contributor

rickyyx commented Apr 19, 2023

image

Doesn't look like it worked?

@rickyyx rickyyx reopened this Apr 19, 2023
pcmoritz pushed a commit that referenced this issue Apr 22, 2023
Why are these changes needed?

Right now the theory is as follow.

pubsub io service is created and run inside the GcsServer. That means if pubsub io service is accessed after GCSServer GC'ed, it will segfault.
Right now, upon teardown, when we call rpc::DrainAndResetExecutor, this will recreate the Executor thread pool.
Upon teardown, If DrainAndResetExecutor -> GcsServer's internal pubsub posts new SendReply to the newly created threadpool -> GcsServer.reset -> pubsub io service GC'ed -> SendReply invoked from the newly created thread pool, it will segfault.
NOTE: the segfault is from pubsub service if you see the failure

#2 0x7f92034d9129 in ray::rpc::ServerCallImpl<ray::rpc::InternalPubSubGcsServiceHandler, ray::rpc::GcsSubscriberPollRequest, ray::rpc::GcsSubscriberPollReply>::HandleRequestImpl()::'lambda'(ray::Status, std::__1::function<void ()>, std::__1::function<void ()>)::operator()(ray::Status, std::__1::function<void ()>, std::__1::function<void ()>) const::'lambda'()::operator()() const /proc/self/cwd/bazel-out/k8-opt/bin/_virtual_includes/grpc_common_lib/ray/rpc/server_call.h:212:48
As a fix, I only drain the thread pool. And then reset it after all operations are fully cleaned up (only from tests). I think there's no need to reset for regular proc termination like raylet, gcs, core workers.

Related issue number

Closes #34344

Signed-off-by: SangBin Cho <[email protected]>
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this issue May 4, 2023
Why are these changes needed?

Right now the theory is as follow.

pubsub io service is created and run inside the GcsServer. That means if pubsub io service is accessed after GCSServer GC'ed, it will segfault.
Right now, upon teardown, when we call rpc::DrainAndResetExecutor, this will recreate the Executor thread pool.
Upon teardown, If DrainAndResetExecutor -> GcsServer's internal pubsub posts new SendReply to the newly created threadpool -> GcsServer.reset -> pubsub io service GC'ed -> SendReply invoked from the newly created thread pool, it will segfault.
NOTE: the segfault is from pubsub service if you see the failure

ray-project#2 0x7f92034d9129 in ray::rpc::ServerCallImpl<ray::rpc::InternalPubSubGcsServiceHandler, ray::rpc::GcsSubscriberPollRequest, ray::rpc::GcsSubscriberPollReply>::HandleRequestImpl()::'lambda'(ray::Status, std::__1::function<void ()>, std::__1::function<void ()>)::operator()(ray::Status, std::__1::function<void ()>, std::__1::function<void ()>) const::'lambda'()::operator()() const /proc/self/cwd/bazel-out/k8-opt/bin/_virtual_includes/grpc_common_lib/ray/rpc/server_call.h:212:48
As a fix, I only drain the thread pool. And then reset it after all operations are fully cleaned up (only from tests). I think there's no need to reset for regular proc termination like raylet, gcs, core workers.

Related issue number

Closes ray-project#34344

Signed-off-by: SangBin Cho <[email protected]>
Signed-off-by: Jack He <[email protected]>
architkulkarni pushed a commit to architkulkarni/ray that referenced this issue May 16, 2023
Why are these changes needed?

Right now the theory is as follow.

pubsub io service is created and run inside the GcsServer. That means if pubsub io service is accessed after GCSServer GC'ed, it will segfault.
Right now, upon teardown, when we call rpc::DrainAndResetExecutor, this will recreate the Executor thread pool.
Upon teardown, If DrainAndResetExecutor -> GcsServer's internal pubsub posts new SendReply to the newly created threadpool -> GcsServer.reset -> pubsub io service GC'ed -> SendReply invoked from the newly created thread pool, it will segfault.
NOTE: the segfault is from pubsub service if you see the failure

#2 0x7f92034d9129 in ray::rpc::ServerCallImpl<ray::rpc::InternalPubSubGcsServiceHandler, ray::rpc::GcsSubscriberPollRequest, ray::rpc::GcsSubscriberPollReply>::HandleRequestImpl()::'lambda'(ray::Status, std::__1::function<void ()>, std::__1::function<void ()>)::operator()(ray::Status, std::__1::function<void ()>, std::__1::function<void ()>) const::'lambda'()::operator()() const /proc/self/cwd/bazel-out/k8-opt/bin/_virtual_includes/grpc_common_lib/ray/rpc/server_call.h:212:48
As a fix, I only drain the thread pool. And then reset it after all operations are fully cleaned up (only from tests). I think there's no need to reset for regular proc termination like raylet, gcs, core workers.

Related issue number

Closes ray-project#34344

Signed-off-by: SangBin Cho <[email protected]>
@rkooo567 rkooo567 removed the flaky-tracker Issue created via Flaky Test Tracker https://flaky-tests.ray.io/ label Oct 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Issues that should be addressed in Ray Core
Projects
None yet
3 participants