Some Issues with mul_mat (at least on small matrices) #478

joshuadahlunr · 2023-08-25T22:38:52Z

Let me begin by apologizing if I have missed something incredibly basic, but I have tried to come at this problem from a multitude of angles. Copying the example of how to use the library out of the documentation at the top of ggml.h and then expanding it to one of the simplest matrix multiplication problems possible might lead to some code which looks something like this:

#include <ggml.h>
#include <array>
#include <iostream>

int main() {
    std::array<float, 9> tData = {1, 2, 3, 4, 5, 6, 7, 8, 9}; // {{1, 2, 3}, {4, 5, 6}, {7, 8, 9}} in row major order
    std::array<float, 9> t2Data = {9, 8, 7, 6, 5, 4, 3, 2, 1}; // {{9, 8, 7}, {6, 5, 4}, {3, 2, 1}} in row major order

    auto ctx = ggml_init({5000, nullptr, false});

    auto t = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, 3, 3);
    auto t2 = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, 3, 3);

    memcpy(t->data, tData.data(), 9 * sizeof(float));
    memcpy(t2->data, t2Data.data(), 9 * sizeof(float));

    auto f = ggml_mul_mat(ctx, t, t2);
    auto gf = ggml_build_forward(f);

    ggml_graph_compute_with_ctx(ctx, &gf, 1);

    for(size_t i = 0; i < 9; i++)
        std::cout << ((float*)f->data)[i] << ", ";
    std::cout << std::endl;
}

Which prints 46, 118, 190, 28, 73, 118, 10, 28, 46 but WolframAlpha says that {{1, 2, 3}, {4, 5, 6}, {7, 8, 9}} * {{9, 8, 7}, {6, 5, 4}, {3, 2, 1}} should result in {{30, 24, 18}, {84, 69, 54}, {138, 114, 90}} and no matter how you order the results in memory the two results aren't even intersecting sets!

I must be handling my data wrong! Replace mul_mat with mul and the result is 9, 16, 21, 24, 25, 24, 21, 16, 9, WolframAlpha? {{9, 16, 21}, {24, 25, 24}, {21, 16, 9}} so my data is correct!

Try a simpler problem... multiply by the identity (t2 = {1, 0, 0, 0, 1, 0, 0, 0, 1}), and the result is 1, 4, 7, 2, 5, 8, 3, 6, 9, which looks close to correct if tensors were stored in column major order... the documentation for mul_mat mentions that the second tensor is transposed internally is this false? Do we need to transpose it our self?

Running t2Data through a custom row major to column major function:

template <typename T>
void row2col(std::span<T> matrix, size_t rows, size_t cols) {
    std::vector<T> tempMatrix(matrix.size());
    for (size_t i = 0; i < rows; ++i) 
        for (size_t j = 0; j < cols; ++j) 
            tempMatrix[j * rows + i] = matrix[i * cols + j];
    std::copy(tempMatrix.begin(), tempMatrix.end(), matrix.begin());
}

before copying into t2's data we get: 30, 84, 138, 24, 69, 114, 18, 54, 90, and 1, 4, 7, 2, 5, 8, 3, 6, 9, which are the correct column major results for each computation. Perfect so we just need to transpose t2 and the result:

auto f = ggml_transpose(ctx, ggml_mul_mat(ctx, t, ggml_transpose(ctx, t2)));

Which when run gives us GGML_ASSERT: ggml.c:11058: nb10 == sizeof(float) Aborted (core dumped)

There are asserts ensuring that neither input nor the output are permuted! Disable the check to only run the OpenBLAS path on large matrices and try again on raw row major data... same results. Try again on a 2x2 matrix with t = {{2, -1}, {1, 3}} and t2 = {{1, 2}, {3, 4}}, WolframAlpha gives us {{-1, 0}, {10, 14}}, ggml gives us 0, 7, 2, 15, "manually" transposing t2 it gives us -1, 10, 0, 14,

I am trying to figure out how whisper.cpp is implemented with these weird transposing rules, what kind of black magic goes into its graph to make this arcane input and output system work?

// The line of code before it (give or take some open and close braces)
cur = ggml_add(ctx0,
    ggml_mul(ctx0,
        ggml_repeat(ctx0, layer.attn_ln_0_w, cur), 
            cur),
        ggml_repeat(ctx0, layer.attn_ln_0_b, cur));

// The first time mul_mat is used in whisper.cpp
struct ggml_tensor * Qcur = ggml_mul_mat(ctx0,
                        layer.attn_q_w,
                        cur);

Nothing?

I must have configured something weird or there is something wrong with my machine. So I asked a coworker to run my minimal replicatable code from the top of this issue built against a fresh clone of ggml... same result! Is there something incredibly simple I missed? Or is there a bug in mul_mat that only appears on small tensors?

The text was updated successfully, but these errors were encountered:

slaren · 2023-08-27T17:18:15Z

Some operations require that the tensors are contiguous in memory. You can use ggml_cont to make the transposed tensor contiguous, such as:

auto f = ggml_transpose(ctx, ggml_mul_mat(ctx, t, ggml_cont(ctx, ggml_transpose(ctx, t2))));

ggerganov · 2023-08-27T18:18:17Z

The suggestion by @slaren should give you the expected answer.

To try to explain this better, lets say you want to multiply the matrices Z = X @ Y.
To obtain Z_ij element at row i and col j you compute the dot product between the i-th row of X with the j-th col of Y.

The oddity about ggml_mul_mat compared to standard implementation is that it actually computes:

Z^T = X @ Y^T

In other words:

the dot products for the Z_ij element is computed between the i-th row of X and the j-th row of Y
the results for Z_ij are stored in Z_ji - i.e. the result is transposed

You can also check the test-mul-mat0.c example - it might be useful to understand what is going on.

joshuadahlunr · 2023-08-28T17:37:28Z

Is there any particular rational for this design? Also if that formula you posted was added as comment over mul_mat it would have saved me 10 hours of experimentation!

ggerganov · 2023-08-28T17:55:34Z

Is there any particular rational for this design?

When I started this project I didn't have a good idea about what is typically done in other frameworks. I mainly wanted to perform the matrix multiplication using "row-by-row" dot products because I thought it is cache-friendly this way.

The result is transposed because this way it is already in the correct memory order to be passed as the second argument to another matrix multiplication:

z_0 = ggml_mul_mat(ctx0, x_0, y);
z_1 = ggml_mul_mat(ctx0, x_1, z_0);
...

This is a common pattern in ML - the xes are typically model weights and the y and zs are intermediate results.

Obviously, this has advantages and drawbacks. But it was easier to think about the dot products and not worry about trashing the L1 cache when implementing a naive matrix multiplication.

Docs are in a very bad state at the moment - hopefully we will update those some time soon.

joshuadahlunr closed this as completed Aug 29, 2023

CCLDArjun pushed a commit to CCLDArjun/ggml that referenced this issue Dec 18, 2023

Add timings for the prompt evaluation (ggerganov#478)

29b7baa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some Issues with mul_mat (at least on small matrices) #478

Some Issues with mul_mat (at least on small matrices) #478

joshuadahlunr commented Aug 25, 2023

slaren commented Aug 27, 2023

ggerganov commented Aug 27, 2023

joshuadahlunr commented Aug 28, 2023

ggerganov commented Aug 28, 2023

Some Issues with mul_mat (at least on small matrices) #478

Some Issues with mul_mat (at least on small matrices) #478

Comments

joshuadahlunr commented Aug 25, 2023

slaren commented Aug 27, 2023

ggerganov commented Aug 27, 2023

joshuadahlunr commented Aug 28, 2023

ggerganov commented Aug 28, 2023