Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework context handling #2346

Merged
merged 21 commits into from
Apr 26, 2024
Merged

Rework context handling #2346

merged 21 commits into from
Apr 26, 2024

Conversation

maleadt
Copy link
Member

@maleadt maleadt commented Apr 25, 2024

Problem

CUDA contexts are annoying:

  • references to identical contexts can be constructed independently through different APIs (cuCtxCreate, cuCtxGetCurrent, etc)
  • destroying a context means that all resources allocated in that context are now invalid, and cannot be used in any API call
  • after destroying a context, creating a new one may result in the same handle being reused
  • Julia can destroy objects out-of-order, e.g., first the CuContext, then a CuStream, even though the stream object had a reference to the context

All this significantly complicates our ability to determine whether objects are safe to use and/or need to be finalized. Currently, we solve this by using some kind of factory method that is guaranteed to return a unique context object for every session of a context handle (i.e., after handle destruction and recreation, this method returns a different object despite the handle being identical). In combination with targeted invalidation of that object from all known APIs that destroy a context, that makes it possible to automatically determine context validity in all derived objects storing a reference. All this relies on multiple global dictionaries, which are slow and fragile (and have resulted in several issues with use with threads, and in finalizers where we can't take locks to safely access that global dict). It also isn't guaranteed to be correct, especially when cooperating with other software that may call context APIs.

Solution

CUDA 12.0 provides a new driver API, cuCtxGetId, which returns a monotonically incrementing identifier that does change when a context is destroyed and re-allocated. This greatly simplifies the design:

  • we no longer need a single unique CuContext object, as we can uniquely identify the object by its identifier
  • we can simply check validity by ensuring we can fetch a context's ID, and that the ID matches what we stored at construction time

This makes it possible to demote CuContext to a simple immutable type, and get rid of all context-related global state, improving thread- and finalizer-safety, while making it much cheaper to store context objects in derived resources.

The flip side: CUDA.jl will require a CUDA 12.x-compatible driver. This seems acceptable to me, given the improvements in this PR and the fact that CUDA 12 has been out for quite a while. People relying on CUDA 11.x can always keep using CUDA.jl 5.x. If needed, we can even make additional releases of CUDA.jl 5.x if backport PRs are suggested.

cc @vchuravy

@maleadt maleadt added enhancement New feature or request cuda libraries Stuff about CUDA library wrappers. performance How fast can we go? labels Apr 25, 2024
Copy link

codecov bot commented Apr 25, 2024

Codecov Report

Attention: Patch coverage is 79.24528% with 11 lines in your changes are missing coverage. Please review.

Project coverage is 60.33%. Comparing base (5dd6bb2) to head (a534c10).

❗ Current head a534c10 differs from pull request most recent head 51fb828. Consider uploading reports for the commit 51fb828 to get more accurate results

Files Patch % Lines
lib/cudadrv/stream.jl 42.85% 4 Missing ⚠️
src/memory.jl 70.00% 3 Missing ⚠️
lib/cudadrv/context.jl 92.85% 2 Missing ⚠️
lib/utils/call.jl 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2346      +/-   ##
==========================================
- Coverage   62.09%   60.33%   -1.77%     
==========================================
  Files         155      155              
  Lines       14965    14926      -39     
==========================================
- Hits         9293     9005     -288     
- Misses       5672     5921     +249     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@maleadt maleadt marked this pull request as ready for review April 25, 2024 13:13
Comment on lines +184 to +187
@test @allocated(current_context()) == 0
@test @allocated(context()) == 0
@test @allocated(stream()) == 0
@test @allocated(device()) == 0
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @KristofferC...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Straight to blacklist ;)

lib/cudadrv/context.jl Outdated Show resolved Hide resolved
@maleadt
Copy link
Member Author

maleadt commented Apr 25, 2024

Alternatively, maybe I should just get rid of the ability to reset contexts; this functionality isn't even handled properly by NVIDIA's own libraries...

@maleadt
Copy link
Member Author

maleadt commented Apr 26, 2024

Or, even better, we could just only support resetting contexts on 12+. That should make it possible to keep compatibility with older drivers, as long as people don't reset the device.

@maleadt
Copy link
Member Author

maleadt commented Apr 26, 2024

Alright, CUDA 11.x support is back. Let's merge this once CI is green.

[skip tests]
[skip benchmarks]
@maleadt maleadt merged commit 752571b into master Apr 26, 2024
1 check passed
@maleadt maleadt deleted the tb/context branch April 26, 2024 14:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda libraries Stuff about CUDA library wrappers. enhancement New feature or request performance How fast can we go?
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants