Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

device_reset! does not seem to work anymore #1579

Closed
maleadt opened this issue Aug 18, 2022 · 2 comments
Closed

device_reset! does not seem to work anymore #1579

maleadt opened this issue Aug 18, 2022 · 2 comments
Labels
bug Something isn't working upstream Somebody else's problem.

Comments

@maleadt
Copy link
Member

maleadt commented Aug 18, 2022

Resetting the device does not free memory:

julia> using CUDA

shell> nvidia-smi
Thu Aug 18 21:27:13 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro RTX 5000     On   | 00000000:2D:00.0 Off |                    0 |
| 33%   32C    P8    10W / 230W |      1MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

julia> CUDA.rand(1024,1024,1024);

shell> nvidia-smi
Thu Aug 18 21:27:27 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro RTX 5000     On   | 00000000:2D:00.0 Off |                    0 |
| 33%   34C    P2    44W / 230W |   4222MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   2350891      C   ...1.8.0-rc4+0.x64/bin/julia     4219MiB |
+-----------------------------------------------------------------------------+

julia> CUDA.device_reset!()

shell> nvidia-smi
Thu Aug 18 21:27:35 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro RTX 5000     On   | 00000000:2D:00.0 Off |                    0 |
| 33%   34C    P2    44W / 230W |   4212MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   2350891      C   ...1.8.0-rc4+0.x64/bin/julia     4209MiB |
+-----------------------------------------------------------------------------+

It seems like the contexts doesn't get validated anymore?

julia> current_context()
CuContext(0x00000000031e5ad0, instance a77a98e6cc90925c)

julia> device_reset!()

julia> current_context()
CuContext(0x00000000031e5ad0, instance ebcc36d4cbb34023)

julia> CUDA.isvalid(current_context())
true

Although we do seem to test that on CI (in the initialization test suite).

@maleadt maleadt added the bug Something isn't working label Aug 18, 2022
@maleadt
Copy link
Member Author

maleadt commented Apr 11, 2024

Even with proper context invalidation, memory isn't freed:

julia> using CUDA
libcuda.cuInit(0) = CUDA_SUCCESS

# allocate some memory
julia> a = CuArray{UInt8}(undef, 1024*1024*1024*8)
libcuda.cuMemPoolCreate(Base.RefValue{Ptr{CUDA.CUmemPoolHandle_st}}, Base.RefValue{CUDA.CUmemPoolProps_st}) = CUDA_SUCCESS
 1: Ptr{CUDA.CUmemPoolHandle_st} @0x000000000384fba8
 2: CUDA.CUmemPoolProps_st(CUDA.CU_MEM_ALLOCATION_TYPE_PINNED, CUDA.CU_MEM_HANDLE_TYPE_NONE, CUDA.CUmemLocation_st(CUDA.CU_MEM_LOCATION_TYPE_DEVICE, 0), Ptr{Nothing} @0x0000000000000000, 0x0000000000000000, (0x00, ..., 0x00))
libcuda.cuMemAllocFromPoolAsync(Base.RefValue{CuPtr{Nothing}}, 1073741824, CuMemoryPool(Ptr{CUDA.CUmemPoolHandle_st} @0x000000000384fba8, CuContext(0x0000000001430b40, instance d0fba442db46d3b7)), CuStream(0x000000000151be30, CuContext(0x0000000001430b40, instance d0fba442db46d3b7)) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000000302800000)

# observe the used memory
julia> nvml_dev = NVML.Device(uuid(device()))
julia> Base.format_bytes(NVML.memory_info(nvml_dev).used)
"10.154 GiB"

# let's reset the device
julia> CUDA.device_reset!()
libcuda.cuDevicePrimaryCtxReset_v2(CuDevice(0)) = CUDA_SUCCESS

# memory is still used, minus the size of the context
julia> Base.format_bytes(NVML.memory_info(nvml_dev).used)
"9.735 GiB"

# nvidia-smi shows the same: not a single compute process, yet a lot of memory used
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.67                 Driver Version: 550.67         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX 6000 Ada Gene...    On  |   00000000:2D:00.0 Off |                  Off |
| 30%   43C    P8             25W /  300W |    9285MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

# let's simply exit without performing ANY additional API call
julia> exit()

# ... yet magically now the memory has re-appeared
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.67                 Driver Version: 550.67         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX 6000 Ada Gene...    On  |   00000000:2D:00.0 Off |                  Off |
| 30%   44C    P3             64W /  300W |       2MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

@maleadt maleadt added the upstream Somebody else's problem. label Apr 11, 2024
@maleadt
Copy link
Member Author

maleadt commented Apr 27, 2024

Upstream let me know that stream ordered memory allocations are explicitly not context bound. CUDA.jl#master now keeps track of them past device reset and frees them manually when the GC frees those objects.

@maleadt maleadt closed this as completed Apr 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working upstream Somebody else's problem.
Projects
None yet
Development

No branches or pull requests

1 participant