Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quantized matmul with CUDA sets the result to zero instead of properly computing it #529

Closed
saharNooby opened this issue Sep 17, 2023 · 5 comments

Comments

@saharNooby
Copy link

saharNooby commented Sep 17, 2023

SOLVED! Read the thread for the investigation details and the solution.


In rwkv.cpp, I'm updating ggml from commit a1d0ea7 to the most recent commit 8ca2c19.

After the update, FP32, FP16 and quantized inference on CPU works. FP32 and FP16 inference on GPU (CUDA) also works.

However, quantized inference on GPU (CUDA) does not work: it silently leaves result tensors filled with zeros. I'm using the same offloading method that worked fine before: set tensor's backend, call ggml_cuda_transform_tensor.

Here's a minimal code that reproduces the behavior:

#include <ggml.h>

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

#define SET_ELEMENT_F32(tensor, i, value) ((float *) tensor->data)[i] = value

void run_test(bool offload) {
struct ggml_init_params params = {
        .mem_size   = 16 * 1024,
        .mem_buffer = NULL,
        .no_alloc   = false,
    };

    struct ggml_context * ctx = ggml_init(params);

    // ---

    struct ggml_tensor * x = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, 32, 1);

    for (int i = 0; i < 32; i++) {
        SET_ELEMENT_F32(x, i, 1.0F * i);
    }

    // ---

    struct ggml_tensor * x_quantized = ggml_new_tensor_2d(ctx, GGML_TYPE_Q4_0, 32, 1);

    int64_t hist[16];
    ggml_quantize_chunk(x_quantized->type, (const float *) x->data, x_quantized->data, 0, 32, hist);

    if (offload) {
        x->backend = GGML_BACKEND_GPU;
        ggml_cuda_transform_tensor(x->data, x);
        
        x_quantized->backend = GGML_BACKEND_GPU;
        ggml_cuda_transform_tensor(x_quantized->data, x_quantized);
    }

    // ---

    struct ggml_tensor * y = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, 32);

    for (int i = 0; i < 32; i++) {
        SET_ELEMENT_F32(y, i, 1.0F * i);
    }

    // ---

    struct ggml_tensor * mul0 = ggml_mul_mat(ctx, x, y);
    struct ggml_tensor * mul1 = ggml_mul_mat(ctx, x_quantized, y);

    struct ggml_cgraph graph = ggml_build_forward(mul0);

    ggml_build_forward_expand(&graph, mul1);

    struct ggml_cplan plan = ggml_graph_plan(&graph, 2);

    uint8_t * work_data = (uint8_t *) malloc(plan.work_size);
    plan.work_data = work_data;

    ggml_graph_compute(&graph, &plan);

    free(work_data);

    fprintf(stderr, "---\n");
    fprintf(stderr, "offload = %d\n", offload);
    fprintf(stderr, "FP32 result = %f\n", ((float *) mul0->data)[0]);
    fprintf(stderr, "Q4_0 result = %f\n", ((float *) mul1->data)[0]);

    ggml_free(ctx);
}

int main(void) {
    #ifdef GGML_USE_CUBLAS

    run_test(false);
    run_test(true);

    #endif

    return 0;
}

On my Windows 10 machine it prints:

---
offload = 0
FP32 result = 10416.000000
Q4_0 result = 10361.083984
---
offload = 1
FP32 result = 10416.000000
Q4_0 result = 0.000000

I expect Q4_0 result when offloading to be equal to the corresponding result when offload is not performed.

I'm 90% sure that this is not a bug in ggml, but I am doing something wrong. How the code above can be fixed?

@saharNooby
Copy link
Author

Hm, maybe it is a bug... If I replace line 5545 with const bool use_mul_mat_vec_q = false && g_compute_capabilities[id] >= MIN_CC_DP4A && mul_mat_vec_q_implemented;, basically, disabling quantized matmul, I get a resonable result:

---
offload = 0
FP32 result = 10416.000000
Q4_0 result = 10361.083984
---
offload = 1
FP32 result = 10416.000000
Q4_0 result = 10354.000000

@Green-Sky
Copy link
Contributor

I think I came across the same error ggerganov/llama.cpp#3202 (comment)

@saharNooby
Copy link
Author

@Green-Sky Thanks for pointing me to that. When building and running the file in Debug mode, I indeed get an assertion failure in vec_dot_q4_0_q8_1_impl:

rwkv.cpp\ggml\src\ggml-cuda.cu:1554: block: [0,0,0], thread: [0,0,0] Assertion `false` failed.
...
CUDA error 710 at rwkv.cpp\ggml\src\ggml-cuda.cu:6132: device-side assert triggered

I'll dig deeper; will try to investigate why __CUDA_ARCH__ is less than MIN_CC_DP4A (610).

@saharNooby
Copy link
Author

I can now force it to work with an ugly crutch: in the main CMakeLists.txt file, I force CUDA architecture with a nvcc compiler option:

add_compile_options("$<$<COMPILE_LANGUAGE:CUDA>:-arch=sm_70>")

IDK how to do it more... properly.

@saharNooby
Copy link
Author

saharNooby commented Sep 17, 2023

@JohannesGaessler Hi! Your recent comment was very helpful for me in debugging this issue. If possible, can you advise on how to properly configure CUDA archs in CMakeLists.txt (see the message above)?

EDIT: Sorry for pinging you. I had set_property(TARGET ggml PROPERTY CUDA_ARCHITECTURES OFF) line further in my CMakeLists.txt which was overrwriting any proper arch configuration.

Normally setting CMAKE_CUDA_ARCHITECTURES works now, the issue is resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants