Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Fix the segfault from Opencensus upon shutdown #36906

Merged
merged 5 commits into from
Jul 6, 2023

Conversation

rkooo567
Copy link
Contributor

@rkooo567 rkooo567 commented Jun 28, 2023

Why are these changes needed?

Seems like client_ in the metrics exporter is not lock-protected, but it is accessed by 2 thread (by a metrics export thread + task execution thread when ExportNow is called). I added a lock. I am not 100% sure if this will fix the issue because I was unable to reproduce the issue, but we can ask the user to try after merging this PR.

Related issue number

Fixes #36885

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: SangBin Cho <[email protected]>
@@ -146,7 +146,6 @@ CoreWorkerProcessImpl::CoreWorkerProcessImpl(const CoreWorkerOptions &options)

CoreWorkerProcessImpl::~CoreWorkerProcessImpl() {
RAY_LOG(INFO) << "Destructing CoreWorkerProcessImpl. pid: " << getpid();
RAY_LOG(DEBUG) << "Stats stop in core worker.";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we added the equivalent logs to stats::Shutdown

Copy link
Contributor

@rickyyx rickyyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we do benchmarks on it?

And looks like from the stacktrace, the SIGSEV is on the request object? Might be an opencensus bug?

@rkooo567
Copy link
Contributor Author

rkooo567 commented Jul 6, 2023

And looks like from the stacktrace, the SIGSEV is on the request object? Might be an opencensus bug?

We are sending a request actually, and the problem is from protobuf, so it is unlikely related to Opencensus. It should be something shutdown ordering issue. I am not sure if my PR will actually fix the following issue, but client is not protected properly, and we should fix it anyway.

@rkooo567 rkooo567 merged commit 9b6a449 into ray-project:master Jul 6, 2023
2 checks passed
rkooo567 added a commit that referenced this pull request Jul 6, 2023
Seems like client_ in the metrics exporter is not lock-protected, but it is accessed by 2 thread (by a metrics export thread + task execution thread when ExportNow is called). I added a lock. I am not 100% sure if this will fix the issue because I was unable to reproduce the issue, but we can ask the user to try after merging this PR.
@rickyyx
Copy link
Contributor

rickyyx commented Jul 7, 2023

I am a bit concerned of the potential perf impact from the locks?

@rkooo567
Copy link
Contributor Author

@rickyyx the report code happens in a separate thread, and the main thread is only locked upon exit, so it should be okay. Let me know if you have other parts that could be problematic...

@zhe-thoughts
Copy link
Collaborator

@rkooo567 is there a cherry-pick PR? Thanks

rkooo567 added a commit to rkooo567/ray that referenced this pull request Jul 11, 2023
)

Seems like client_ in the metrics exporter is not lock-protected, but it is accessed by 2 thread (by a metrics export thread + task execution thread when ExportNow is called). I added a lock. I am not 100% sure if this will fix the issue because I was unable to reproduce the issue, but we can ask the user to try after merging this PR.
@rkooo567
Copy link
Contributor Author

created here; #37311

SongGuyang pushed a commit to alipay/ant-ray that referenced this pull request Jul 12, 2023
)

Seems like client_ in the metrics exporter is not lock-protected, but it is accessed by 2 thread (by a metrics export thread + task execution thread when ExportNow is called). I added a lock. I am not 100% sure if this will fix the issue because I was unable to reproduce the issue, but we can ask the user to try after merging this PR.

Signed-off-by: 久龙 <[email protected]>
bveeramani pushed a commit that referenced this pull request Jul 13, 2023
Seems like client_ in the metrics exporter is not lock-protected, but it is accessed by 2 thread (by a metrics export thread + task execution thread when ExportNow is called). I added a lock. I am not 100% sure if this will fix the issue because I was unable to reproduce the issue, but we can ask the user to try after merging this PR.
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
)

Seems like client_ in the metrics exporter is not lock-protected, but it is accessed by 2 thread (by a metrics export thread + task execution thread when ExportNow is called). I added a lock. I am not 100% sure if this will fix the issue because I was unable to reproduce the issue, but we can ask the user to try after merging this PR.

Signed-off-by: harborn <[email protected]>
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
)

Seems like client_ in the metrics exporter is not lock-protected, but it is accessed by 2 thread (by a metrics export thread + task execution thread when ExportNow is called). I added a lock. I am not 100% sure if this will fix the issue because I was unable to reproduce the issue, but we can ask the user to try after merging this PR.
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
)

Seems like client_ in the metrics exporter is not lock-protected, but it is accessed by 2 thread (by a metrics export thread + task execution thread when ExportNow is called). I added a lock. I am not 100% sure if this will fix the issue because I was unable to reproduce the issue, but we can ask the user to try after merging this PR.

Signed-off-by: e428265 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Core] OpenCensus segfault upon shutdown
4 participants