Zero length allocation failure #194

agerasev · 2023-12-04T06:58:46Z

Hi!

I'm facing an issue with zero length memory allocation (while trying to run candle on GTX 970). Here is the minimal reproducer:

let dev = cudarc::driver::CudaDevice::new(0).unwrap();
dev.null::<f32>().unwrap();

On my machine it fails with DriverError(CUDA_ERROR_INVALID_VALUE, "invalid argument"). With this workaround it works fine.

I didn't find documentation for cuMemAlloc_v2 but for cuMemAlloc it says:

If bytesize is 0, cuMemAlloc() returns CUDA_ERROR_INVALID_VALUE

Maybe cuMemAlloc_v2 shouldn't be called at all if num_bytes is zero?

The text was updated successfully, but these errors were encountered:

agerasev · 2023-12-04T06:59:08Z

My system:

$ uname -a
Linux  6.1.0-13-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.55-1 (2023-09-29) x86_64 GNU/Linux

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

$ nvidia-smi -q

==============NVSMI LOG==============

Timestamp                                 : Mon Dec  4 13:40:18 2023
Driver Version                            : 525.125.06
CUDA Version                              : 12.0

Attached GPUs                             : 1
GPU 00000000:03:00.0
    Product Name                          : NVIDIA GeForce GTX 970
    Product Brand                         : GeForce
    Product Architecture                  : Maxwell
    Display Mode                          : Enabled
    Display Active                        : Enabled
    Persistence Mode                      : Enabled
...

coreylowman · 2024-01-09T01:15:19Z

I think this function is behaving as it should - it's returning a result (and the unwrap turns it into a panic). I think this should probably be raised as an issue on candle's repo. Do you know where in candle it's coming from?

agerasev · 2024-01-09T05:22:21Z

Do you know where in candle it's coming from?

It can occur in many places in candle_core::cuda_backend where alloc or htod_copy called. There is no checks for zero length here, they are assumed to be successful.

I think this function is behaving as it should - it's returning a result (and the unwrap turns it into a panic).

The problem is that this behavior is inconsistent - it seems that on most devices zero allocation succeeds (and candle relies on this) but on GTX 970 it fails.

coreylowman · 2024-01-09T14:26:24Z

I'm not really sure what we can do in this case - this seems like a driver level issue. We don't have any device specific code in cudarc, so I guess I'm not sure what the outcome should be. I'm hesitant to use a null pointer (i.e. not actually call cuMalloc) because I don't really know what the downstream effect of that would be or how the cuda driver interacts with all of those.

Can you print out the CudaDevice in your example? I want to see if the is_async is false

let dev = cudarc::driver::CudaDevice::new(0).unwrap();
println!("{:?}", dev);

agerasev · 2024-01-10T03:21:09Z

Can you print out the CudaDevice in your example?

CudaDevice {
    cu_device: 0,
    cu_primary_ctx: 0x000055759b945ec0,
    stream: 0x0000000000000000,
    event: 0x000055759bc8d4f0,
    modules: RwLock {
        data: {},
        poisoned: false,
        ..
    },
    ordinal: 0,
    is_async: false,
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zero length allocation failure #194

Zero length allocation failure #194

agerasev commented Dec 4, 2023 •

edited

Loading

agerasev commented Dec 4, 2023 •

edited

Loading

coreylowman commented Jan 9, 2024 •

edited

Loading

agerasev commented Jan 9, 2024 •

edited

Loading

coreylowman commented Jan 9, 2024

agerasev commented Jan 10, 2024

Zero length allocation failure #194

Zero length allocation failure #194

Comments

agerasev commented Dec 4, 2023 • edited Loading

agerasev commented Dec 4, 2023 • edited Loading

coreylowman commented Jan 9, 2024 • edited Loading

agerasev commented Jan 9, 2024 • edited Loading

coreylowman commented Jan 9, 2024

agerasev commented Jan 10, 2024

agerasev commented Dec 4, 2023 •

edited

Loading

agerasev commented Dec 4, 2023 •

edited

Loading

coreylowman commented Jan 9, 2024 •

edited

Loading

agerasev commented Jan 9, 2024 •

edited

Loading