-
Notifications
You must be signed in to change notification settings - Fork 8.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA: use tensor cores for MMQ #7676
CUDA: use tensor cores for MMQ #7676
Conversation
|
2879780
to
bf10e13
Compare
I implemented support for q4_0, q4_1, q5_0, q5_1, and q8_0 based on #7824 . The performance currently looks like this: Vs. master MMQ
Vs. master cuBLAS
My immediate next goals will be to add support for k-quants, optimize the performance, and to refactor and simplify the code (in that order). |
There was a bug with out-of-bounds writes. That's why the server bench performance was bad in terms of request throughput: the generations were garbage and never hit the EOS token. |
54bf8d5
to
054d4ea
Compare
ggml-cuda/mma.cuh
Outdated
|
||
static __device__ __forceinline__ int get_j(const int /* l */) { | ||
const int ret = threadIdx.x / (K/2); | ||
__builtin_assume(ret >= 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a GGML_CUDA_ASSUME
macro because this is not available on every version of the compiler.
A bit slower than f16 cuBLAS with large batch sizes, but well worth it for the lower batch sizes and memory savings.
|
Btw, the server bench still produces unexpectedly low number of iterations - 283 in last run, with 212 of them being truncated. Maybe there is still some lingering issue |
Do you mean the run that the bot posted in this PR? That was prior to the fix. I was able to reproduce the issue when running the server benchmark locally and my fix worked to restore the performance in terms of iterations/time. |
The bot updates the post after each new successful commit. See the edit in this comment from ~hour ago: |
Thank you, I wasn't aware that that is how the bot works. There seems to have still been an issue where (for some matrix shapes) the writeback returned too early. And because the exact kernel that is being run depends on the SM count of a GPU I presumably just never encountered one of the problematic matrix shapes while testing. |
This PR aims to add int8 tensor core support for mul_mat_q kernels (legacy quants only for now). The supported hardware will be Turing or newer. So far there is only a prototype for q8_0 which on its own is still slower than FP16 cuBLAS but faster for end-to-end performance because it needs less data conversion. Current performance:
As of right now this PR must be compiled with
LLAMA_CUDA_FORCE_MMQ
.scripts/copare_llama_bench.py
needs the fix added by #7673 .The way to make int8 tensor cores work is to write PTX code (the CUDA equivalent of assembly) because with the "high level" WMMA interface you do not have a defined memory layout which makes it impossible to correctly apply the scales of ggml quantized data blocks. I plan to wrap the PTX code in simple CUDA functions in order to hopefully make it easier to understand what it does.