device_reset! does not seem to work anymore #1579

maleadt · 2022-08-18T19:32:25Z

Resetting the device does not free memory:

julia> using CUDA

shell> nvidia-smi
Thu Aug 18 21:27:13 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro RTX 5000     On   | 00000000:2D:00.0 Off |                    0 |
| 33%   32C    P8    10W / 230W |      1MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

julia> CUDA.rand(1024,1024,1024);

shell> nvidia-smi
Thu Aug 18 21:27:27 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro RTX 5000     On   | 00000000:2D:00.0 Off |                    0 |
| 33%   34C    P2    44W / 230W |   4222MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   2350891      C   ...1.8.0-rc4+0.x64/bin/julia     4219MiB |
+-----------------------------------------------------------------------------+

julia> CUDA.device_reset!()

shell> nvidia-smi
Thu Aug 18 21:27:35 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro RTX 5000     On   | 00000000:2D:00.0 Off |                    0 |
| 33%   34C    P2    44W / 230W |   4212MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   2350891      C   ...1.8.0-rc4+0.x64/bin/julia     4209MiB |
+-----------------------------------------------------------------------------+

It seems like the contexts doesn't get validated anymore?

julia> current_context()
CuContext(0x00000000031e5ad0, instance a77a98e6cc90925c)

julia> device_reset!()

julia> current_context()
CuContext(0x00000000031e5ad0, instance ebcc36d4cbb34023)

julia> CUDA.isvalid(current_context())
true

Although we do seem to test that on CI (in the initialization test suite).

The text was updated successfully, but these errors were encountered:

maleadt · 2024-04-11T14:16:45Z

Even with proper context invalidation, memory isn't freed:

julia> using CUDA
libcuda.cuInit(0) = CUDA_SUCCESS

# allocate some memory
julia> a = CuArray{UInt8}(undef, 1024*1024*1024*8)
libcuda.cuMemPoolCreate(Base.RefValue{Ptr{CUDA.CUmemPoolHandle_st}}, Base.RefValue{CUDA.CUmemPoolProps_st}) = CUDA_SUCCESS
 1: Ptr{CUDA.CUmemPoolHandle_st} @0x000000000384fba8
 2: CUDA.CUmemPoolProps_st(CUDA.CU_MEM_ALLOCATION_TYPE_PINNED, CUDA.CU_MEM_HANDLE_TYPE_NONE, CUDA.CUmemLocation_st(CUDA.CU_MEM_LOCATION_TYPE_DEVICE, 0), Ptr{Nothing} @0x0000000000000000, 0x0000000000000000, (0x00, ..., 0x00))
libcuda.cuMemAllocFromPoolAsync(Base.RefValue{CuPtr{Nothing}}, 1073741824, CuMemoryPool(Ptr{CUDA.CUmemPoolHandle_st} @0x000000000384fba8, CuContext(0x0000000001430b40, instance d0fba442db46d3b7)), CuStream(0x000000000151be30, CuContext(0x0000000001430b40, instance d0fba442db46d3b7)) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000000302800000)

# observe the used memory
julia> nvml_dev = NVML.Device(uuid(device()))
julia> Base.format_bytes(NVML.memory_info(nvml_dev).used)
"10.154 GiB"

# let's reset the device
julia> CUDA.device_reset!()
libcuda.cuDevicePrimaryCtxReset_v2(CuDevice(0)) = CUDA_SUCCESS

# memory is still used, minus the size of the context
julia> Base.format_bytes(NVML.memory_info(nvml_dev).used)
"9.735 GiB"

# nvidia-smi shows the same: not a single compute process, yet a lot of memory used
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.67                 Driver Version: 550.67         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX 6000 Ada Gene...    On  |   00000000:2D:00.0 Off |                  Off |
| 30%   43C    P8             25W /  300W |    9285MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

# let's simply exit without performing ANY additional API call
julia> exit()

# ... yet magically now the memory has re-appeared
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.67                 Driver Version: 550.67         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX 6000 Ada Gene...    On  |   00000000:2D:00.0 Off |                  Off |
| 30%   44C    P3             64W /  300W |       2MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

maleadt · 2024-04-27T17:57:36Z

Upstream let me know that stream ordered memory allocations are explicitly not context bound. CUDA.jl#master now keeps track of them past device reset and frees them manually when the GC frees those objects.

maleadt added the bug Something isn't working label Aug 18, 2022

maleadt added the upstream Somebody else's problem. label Apr 11, 2024

maleadt closed this as completed Apr 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

device_reset! does not seem to work anymore #1579

device_reset! does not seem to work anymore #1579

maleadt commented Aug 18, 2022

maleadt commented Apr 11, 2024

maleadt commented Apr 27, 2024

device_reset! does not seem to work anymore #1579

device_reset! does not seem to work anymore #1579

Comments

maleadt commented Aug 18, 2022

maleadt commented Apr 11, 2024

maleadt commented Apr 27, 2024