Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml : extend ggml_mul_mat to support non-F32 input for parameter b #455

Open
ggerganov opened this issue Aug 16, 2023 · 5 comments
Open
Labels
refactoring Refactoring

Comments

@ggerganov
Copy link
Owner

ggerganov commented Aug 16, 2023

Currently, we always pass b to ggml_mul_mat as F32 and internally quantize it depending on the type of a.
There is no option that allows to pass an already quantized b.

The primary goal of this task is to add such option.
For more info, see: ggerganov/llama.cpp#2615 (comment)

The primary focus will be ggml_mul_mat, but we can also think about some more general approach for the rest of the operators. For example, ggml_mul currently also works with just F32 input, which prevents from having 1D F16 norm tensors. This is not a huge drawback since these tensors are usually small, but would be nice to also support F16.

Additionally, we can extend ggml with parameters that control the implicit quantizations.
I.e. disable / enable / change types, etc. This is secondary objective and not 100% sure how it would work from an API POV

@klosax
Copy link
Contributor

klosax commented Aug 16, 2023

I guess this is the reason why 1D tensors needs to be stored as F32 in the model files.

@klosax
Copy link
Contributor

klosax commented Aug 16, 2023

This also seems to be needed for PR ggerganov/llama.cpp#2632

@ggerganov
Copy link
Owner Author

ggerganov commented Aug 16, 2023

I guess this is the reason why 1D tensors needs to be stored as F32 in the model files.

Not exactly - it is a similar problem, but it is related to ggml_mul, not ggml_mul_mat.
I've expanded the issue description to mention this and we can try to fix this within the scope of this task.

@slaren
Copy link
Collaborator

slaren commented Aug 19, 2023

Additionally, we can extend ggml with parameters that control the implicit quantizations.
I.e. disable / enable / change types, etc. This is secondary objective and not 100% sure how it would work from an API POV

We could add a flag to ggml_context to control whether to automatically quantize b in ggml_mul_mat. Essentially, we would add this to ggml_mul_mat:

if (ctx->quantize_matmul && b->type != type_traits[a->type].vec_dot_type) {
    struct ggml_tensor * tmp_b = ggml_new_tensor(ctx, type_traits[a->type].vec_dot_type, b->n_dims, b->ne);
    b = ggml_cpy(ctx, b, tmp_b);
}

The disadvantage is that it may add a bit of overhead by increasing the size of the graphs, but it shouldn't be too bad. To avoid this, we could set a flag in op_params instead, and let each backend handle the conversion, but that doesn't solve the memory management issue.

@slaren
Copy link
Collaborator

slaren commented Jan 2, 2024

Somewhat related to this, it would be good to have a ggml_convert(src, new_type) or similar that is equivalent to ggml_cpy(src, ggml_new_tensor(new_type, 4, src->ne)), but only creates one tensor. Currently, converting a tensor to a different type requires adding two tensors to the context, and this may become a not-completely-insignificant overhead if most matrix multiplications require one or two conversions. It would also result in a slightly more intuitive API, since currently it is not too obvious that ggml_cpy can be used to convert a tensor to a different type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refactoring Refactoring
Projects
Status: Todo
Development

No branches or pull requests

3 participants