Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mark all CUDA ccalls as GC safe #2262

Merged
merged 11 commits into from
Feb 14, 2024
Merged

Mark all CUDA ccalls as GC safe #2262

merged 11 commits into from
Feb 14, 2024

Conversation

vchuravy
Copy link
Member

@vchuravy vchuravy commented Feb 9, 2024

Using https://docs.julialang.org/en/v1/stdlib/Profile/#Triggered-During-Execution I observed an issue in #2261 where
we acquire a lock in cuOccupancyMaxPotentialBlockSize and seemingly get live locked. Other threads
suspend in waiting for GC and no forward progress is being made.

Manually marking cuOccupancyMaxPotentialBlockSize as safe to execute GC next to seemingly fixes the issue,
we could add a convenience macro for doing this.


Update by @maleadt: I made all ccalls GC safe, made sure callbacks are GC unsafe again.

x-ref JuliaLang/julia#49933

Copy link

codecov bot commented Feb 9, 2024

Codecov Report

Attention: 7 lines in your changes are missing coverage. Please review.

Comparison is base (38e28b7) 71.31% compared to head (9327071) 71.33%.

❗ Current head 9327071 differs from pull request most recent head 7067f9b. Consider uploading reports for the commit 7067f9b to get more accurate results

Files Patch % Lines
lib/cudadrv/occupancy.jl 37.50% 5 Missing ⚠️
lib/utils/call.jl 92.85% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2262      +/-   ##
==========================================
+ Coverage   71.31%   71.33%   +0.02%     
==========================================
  Files         155      155              
  Lines       14859    14893      +34     
==========================================
+ Hits        10596    10624      +28     
- Misses       4263     4269       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@maleadt
Copy link
Member

maleadt commented Feb 12, 2024

Pushed a macro version, and used that to mark all libcuda ccall's as GC-safe. I'm not confident this is allowed though, or at least not on all Julia versions, when doing a ccall that may call back into Julia.

Running the MWE from #2261 repeatedly on 1.8 doesn't seem to crash though.

lib/cudadrv/occupancy.jl Outdated Show resolved Hide resolved
lib/utils/call.jl Outdated Show resolved Hide resolved
@maleadt maleadt changed the title Mark cuOccupancyMaxPotentialBlockSize as GC threadsafe Mark all CUDA ccalls as GC safe Feb 13, 2024
@maleadt maleadt added the bugfix This gets something working again. label Feb 13, 2024
@maleadt maleadt force-pushed the vc/mark_gc_safe branch 3 times, most recently from e6ab9d9 to 986f167 Compare February 13, 2024 18:29
gc_state = @ccall(jl_gc_unsafe_enter()::Int8)
try
$body
finally
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was debating if the try/finally was necessary and decided that it wasn't. Since we can only catch exceptions in gc_unsafe. cc: @vtjnash for a double check on that.

Copy link
Contributor

@vtjnash vtjnash Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The try setup seems a bit risky to be present inside jl_gc_safe_enter, a catch/finally already implicitly reset the gc state upon leave (normal or error)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The try happens after jl_gc_unsafe_enter?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I don't see how this is reachable, since cfunction already had to do an unsafe-enter to be able to dispatch into the runtime

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see this is for v1.8. That seems okay. I am surprised that is still supported here

@maleadt maleadt force-pushed the vc/mark_gc_safe branch 3 times, most recently from 3f9e13c to 4779f9c Compare February 14, 2024 08:58
Don't do the ccall change for cuQuantum as it doesn't support
the latest cuTENSOR, so it would be tricky maintaining both
in tree.

[skip julia]
[skip cuda]
@maleadt maleadt merged commit fc1d509 into master Feb 14, 2024
1 check was pending
@maleadt maleadt deleted the vc/mark_gc_safe branch February 14, 2024 09:24
@maleadt maleadt linked an issue Feb 14, 2024 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix This gets something working again.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multi-threaded code hanging forever with Julia 1.10
3 participants