Quantized matmul with CUDA sets the result to zero instead of properly computing it #529

saharNooby · 2023-09-17T11:13:44Z

SOLVED! Read the thread for the investigation details and the solution.

In rwkv.cpp, I'm updating ggml from commit a1d0ea7 to the most recent commit 8ca2c19.

After the update, FP32, FP16 and quantized inference on CPU works. FP32 and FP16 inference on GPU (CUDA) also works.

However, quantized inference on GPU (CUDA) does not work: it silently leaves result tensors filled with zeros. I'm using the same offloading method that worked fine before: set tensor's backend, call ggml_cuda_transform_tensor.

Here's a minimal code that reproduces the behavior:

#include <ggml.h>

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

#define SET_ELEMENT_F32(tensor, i, value) ((float *) tensor->data)[i] = value

void run_test(bool offload) {
struct ggml_init_params params = {
        .mem_size   = 16 * 1024,
        .mem_buffer = NULL,
        .no_alloc   = false,
    };

    struct ggml_context * ctx = ggml_init(params);

    // ---

    struct ggml_tensor * x = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, 32, 1);

    for (int i = 0; i < 32; i++) {
        SET_ELEMENT_F32(x, i, 1.0F * i);
    }

    // ---

    struct ggml_tensor * x_quantized = ggml_new_tensor_2d(ctx, GGML_TYPE_Q4_0, 32, 1);

    int64_t hist[16];
    ggml_quantize_chunk(x_quantized->type, (const float *) x->data, x_quantized->data, 0, 32, hist);

    if (offload) {
        x->backend = GGML_BACKEND_GPU;
        ggml_cuda_transform_tensor(x->data, x);
        
        x_quantized->backend = GGML_BACKEND_GPU;
        ggml_cuda_transform_tensor(x_quantized->data, x_quantized);
    }

    // ---

    struct ggml_tensor * y = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, 32);

    for (int i = 0; i < 32; i++) {
        SET_ELEMENT_F32(y, i, 1.0F * i);
    }

    // ---

    struct ggml_tensor * mul0 = ggml_mul_mat(ctx, x, y);
    struct ggml_tensor * mul1 = ggml_mul_mat(ctx, x_quantized, y);

    struct ggml_cgraph graph = ggml_build_forward(mul0);

    ggml_build_forward_expand(&graph, mul1);

    struct ggml_cplan plan = ggml_graph_plan(&graph, 2);

    uint8_t * work_data = (uint8_t *) malloc(plan.work_size);
    plan.work_data = work_data;

    ggml_graph_compute(&graph, &plan);

    free(work_data);

    fprintf(stderr, "---\n");
    fprintf(stderr, "offload = %d\n", offload);
    fprintf(stderr, "FP32 result = %f\n", ((float *) mul0->data)[0]);
    fprintf(stderr, "Q4_0 result = %f\n", ((float *) mul1->data)[0]);

    ggml_free(ctx);
}

int main(void) {
    #ifdef GGML_USE_CUBLAS

    run_test(false);
    run_test(true);

    #endif

    return 0;
}

On my Windows 10 machine it prints:

---
offload = 0
FP32 result = 10416.000000
Q4_0 result = 10361.083984
---
offload = 1
FP32 result = 10416.000000
Q4_0 result = 0.000000

I expect Q4_0 result when offloading to be equal to the corresponding result when offload is not performed.

I'm 90% sure that this is not a bug in ggml, but I am doing something wrong. How the code above can be fixed?

The text was updated successfully, but these errors were encountered:

saharNooby · 2023-09-17T12:54:40Z

Hm, maybe it is a bug... If I replace line 5545 with const bool use_mul_mat_vec_q = false && g_compute_capabilities[id] >= MIN_CC_DP4A && mul_mat_vec_q_implemented;, basically, disabling quantized matmul, I get a resonable result:

---
offload = 0
FP32 result = 10416.000000
Q4_0 result = 10361.083984
---
offload = 1
FP32 result = 10416.000000
Q4_0 result = 10354.000000

Green-Sky · 2023-09-17T14:50:17Z

I think I came across the same error ggerganov/llama.cpp#3202 (comment)

saharNooby · 2023-09-17T15:13:58Z

@Green-Sky Thanks for pointing me to that. When building and running the file in Debug mode, I indeed get an assertion failure in vec_dot_q4_0_q8_1_impl:

rwkv.cpp\ggml\src\ggml-cuda.cu:1554: block: [0,0,0], thread: [0,0,0] Assertion `false` failed.
...
CUDA error 710 at rwkv.cpp\ggml\src\ggml-cuda.cu:6132: device-side assert triggered

I'll dig deeper; will try to investigate why __CUDA_ARCH__ is less than MIN_CC_DP4A (610).

saharNooby · 2023-09-17T15:23:53Z

I can now force it to work with an ugly crutch: in the main CMakeLists.txt file, I force CUDA architecture with a nvcc compiler option:

add_compile_options("$<$<COMPILE_LANGUAGE:CUDA>:-arch=sm_70>")

IDK how to do it more... properly.

saharNooby · 2023-09-17T15:25:17Z

@JohannesGaessler Hi! Your recent comment was very helpful for me in debugging this issue. If possible, can you advise on how to properly configure CUDA archs in CMakeLists.txt (see the message above)?

EDIT: Sorry for pinging you. I had set_property(TARGET ggml PROPERTY CUDA_ARCHITECTURES OFF) line further in my CMakeLists.txt which was overrwriting any proper arch configuration.

Normally setting CMAKE_CUDA_ARCHITECTURES works now, the issue is resolved.

saharNooby closed this as completed Sep 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantized matmul with CUDA sets the result to zero instead of properly computing it #529

Quantized matmul with CUDA sets the result to zero instead of properly computing it #529

saharNooby commented Sep 17, 2023 •

edited

Loading

saharNooby commented Sep 17, 2023

Green-Sky commented Sep 17, 2023

saharNooby commented Sep 17, 2023

saharNooby commented Sep 17, 2023

saharNooby commented Sep 17, 2023 •

edited

Loading

Quantized matmul with CUDA sets the result to zero instead of properly computing it #529

Quantized matmul with CUDA sets the result to zero instead of properly computing it #529

Comments

saharNooby commented Sep 17, 2023 • edited Loading

saharNooby commented Sep 17, 2023

Green-Sky commented Sep 17, 2023

saharNooby commented Sep 17, 2023

saharNooby commented Sep 17, 2023

saharNooby commented Sep 17, 2023 • edited Loading

saharNooby commented Sep 17, 2023 •

edited

Loading

saharNooby commented Sep 17, 2023 •

edited

Loading