-
Notifications
You must be signed in to change notification settings - Fork 9.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA: int8 tensor cores for MMQ (q4_K, q5_K, q6_K) #7860
CUDA: int8 tensor cores for MMQ (q4_K, q5_K, q6_K) #7860
Conversation
48ecafb
to
dc0ef0c
Compare
dc0ef0c
to
8cb2dbd
Compare
Looks very good. Should MMQ be the default again?
|
@JohannesGaessler both your comparison tables are vs master mmq. |
Give me a bit more time to implement q2_K and q3_K and to optimize performance (particularly asynchronous data loading). Then I think MMQ will be universally faster. Also, in case you're not aware, int8 tensor cores are only available with Ampere (rather than Volta). So for V100s FP16 cuBLAS should still be the fastest option.
Only for batch sizes 16, 32, and 64; I compared vs. with/without |
For BS >= 512 F16 cuBLAS is still faster even with tensor cores, is that correct? |
It depends on the quantization format and hardware; q4_0 on an RTX 4090 seems to be the best case scenario where it seems to already be faster even for large batch sizes. FP16 cuBLAS:
int8 tensor core MMQ:
|
@JohannesGaessler Are you plan to do the same for IQ quants? It would be nice to run Int8 on my P40 instead of FP32. IQ quant has been very slow on that card |
I will prioritize the quantization formats that already have MMQ implementations (legacy, k-quants) but long-term I plan to also implement kernels for the other quantization formats. |
Benchmark on Turing (RTX 2060, FA, batch size default, 4096 context, q4_k_s) Cublas
Force_MMQ
Looks like MMQ's prompt processing is still quite a bit slower. I've tested this with the most up to date build at the time of this writing. |
Did you use make or cmake to build the project? As of right now cmake compiles for the wrong CUDA architectures so the int8 tensor cores aren't going to actually be used. |
Yep, you are correct. I'm using Cmake-GUI for Windows. Anything I can do to compile it for Turing? Then I might rerun the test. |
Lines 426 and 428 in |
Yep, that was it. It's faster than cublas now, wow. Great result!! Also takes around 200 MB less VRAM, which is a great bonus. Thank you for the amazing work again! |
You were using a batch size of 512, correct? |
This PR adds int8 tensor core support for the q4_K, q5_K, and q6_K mul_mat_q kernels. Originally I wanted to put all k-quants into the same PR but in retrospect the MMQ code for q2_K and q3_K is kind of bad so I think it's in need of general refactoring before I try to add int8 tensor core support.
Performance vs. master MMQ
Performance vs. master FP16 cuBLAS