[SYCL] iq2_s #6052

abhilash1910 · 2024-03-14T11:46:06Z

Initial Support for IQ2_S . Requires validation before merge.
@NeoZhangJianyu @airMeng @ggerganov @AidanBeltonS

AidanBeltonS · 2024-03-14T15:23:53Z

Tested on A100 GPU and Intel PVC.
This fails due to missing case in ggml_sycl_op_dequantize_mul_mat_vec

MUL_MAT(type_a=iq2_s,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[1,1]): GGML_ASSERT: /home/aidanbelton/source/temp/llama.cpp/ggml-sycl.cpp:14085: false

abhilash1910 · 2024-03-14T17:47:53Z

Yes this is known for other quants as well, I will fix this .

NeoZhangJianyu · 2024-03-15T11:36:48Z

Yes this is known for other quants as well, I will fix this .

What's the performance change with this PR?
Can you test with llama-bench on Intel GPU? And compare the data with legacy.

AidanBeltonS

Currently breaks both supported target vendors

AidanBeltonS · 2024-03-18T14:30:26Z

ggml-sycl.cpp

@@ -15435,7 +15630,7 @@ static void ggml_sycl_mul_mat(const ggml_tensor * src0, const ggml_tensor * src1
 #ifdef GGML_SYCL_FORCE_DMMV
            const bool use_mul_mat_vec_q = false;
 #else
-            const bool use_mul_mat_vec_q = min_compute_capability >= VER_4VEC && ggml_is_quantized(src0->type) && ggml_nrows(src1) == 1;
+            const bool use_mul_mat_vec_q = min_compute_capability >= VER_4VEC && ggml_is_quantized(src0->type);


Suggested change

const bool use_mul_mat_vec_q = min_compute_capability >= VER_4VEC && ggml_is_quantized(src0->type);

const bool use_mul_mat_vec_q = min_compute_capability >= VER_4VEC && ggml_is_quantized(src0->type) & ggml_nrows(src1) == 1;

This is a breaking change for both Intel and Nvidia targets. ggml_sycl_op_mul_mat_vec_q asserts that GGML_ASSERT(ggml_nrows(src1) == 1);

Yes actually the assertion should not be mandatory and ggml_nrows=1 wont allow the iq stages to run.

Iv tested with the suppression of the assert. It fails on both Intel and Nvidia targets. In both cases it is due to the function lacking the case for your new quantization type

Intel (Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1100 1.3 [1.3.28454]):
There is a failure in ggml_sycl_op_dequantize_mul_mat_vec of
MUL_MAT(type_a=iq2_s,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1]): GGML_ASSERT: /home/aidanbelton/source/temp/llama.cpp/ggml-sycl.cpp:14217: false

Nvidia (NVIDIA A100-PCIE-40GB 8.0 [CUDA 12.2]):
There is a failure in ggml_sycl_op_mul_mat_q of MUL_MAT(type_a=iq2_s,type_b=f32,m=16,n=16,k=256,bs=[1,1],nr=[1,1]): GGML_ASSERT: /home/aidanbelton/source/temp/llama.cpp/ggml-sycl.cpp:14019: false. Issue is missing case for quantization type

Actually for both the cases, it should go to neither of the methods. It is strange that it is falling back to either dequantize_mul_mat_vec or op_mul_mat_q path. For this type only vectorised mul_mat_q-mmvq should be called.

ggml-sycl.cpp

fix build

NeoZhangJianyu · 2024-03-19T09:26:39Z

@abhilash1910 I test a basic model llama2-7b-Q4 with your code. it's fault.
Could you check your code?

I feel your solution change more legacy code, instead of only impacting IQ2_S data type.
That will make it complex and be easy to fault.

And in ggml_sycl_mul_mat(), the fp32 and fp16 branchs are removed. I'm afraid it will impact the legacy (Q4) performance.
Could you run llama-bench to test the performance of Q4 won't reduce.

I suggest you update the code with minimum change to impact IQ2_S type only.
That will reduce the risk to impact legacy.

abhilash1910 · 2024-03-26T07:40:44Z

@abhilash1910 I test a basic model llama2-7b-Q4 with your code. it's fault. Could you check your code?

I feel your solution change more legacy code, instead of only impacting IQ2_S data type. That will make it complex and be easy to fault.

And in ggml_sycl_mul_mat(), the fp32 and fp16 branchs are removed. I'm afraid it will impact the legacy (Q4) performance. Could you run llama-bench to test the performance of Q4 won't reduce.

I suggest you update the code with minimum change to impact IQ2_S type only. That will reduce the risk to impact legacy.

In this case, the logic flow is kept unchanged w.r.t fp32/16 dtypes . I think it wont impact the legacy functionality and would rather send the src types to correct branch (dmmq/mmq or mmvq) based on the implemented branching rules in cuda source. This would help enable other quants through mmvq pathway as well.

NeoZhangJianyu · 2024-03-27T00:55:13Z

How about CI passed?
Why change ggml_sycl_mul_mat()? I see the logic is complex than latest ggml_cuda_mul_mat().
I think CUDA code change is based on the test result.
SYCL code need same test work before update, instead of copy from CUDA.
As I know, convert fp32 to fp16 in some OPs won't get benefit in SYCL backend.

This PR is used to support IQ2_S, I suggest to minimum the scope as possible.
If IQ2_S OP can't be called for IQ2_S model, we need to move the IQ2_S to right position to be called, instead of change whole ggml_sycl_mul_mat() logic.

If change logic in ggml_sycl_mul_mat(), please test CI and llama-bench, make sure the performance on Intel GPU not be impacted. It's more work.

SCYL support IQ2_S is a task, refactor ggml_sycl_mul_mat() is another task.
I suggest to rm second task in this PR to reduce the risk.

abhilash1910 · 2024-03-27T03:49:16Z

How about CI passed?

Why change ggml_sycl_mul_mat()? I see the logic is complex than latest ggml_cuda_mul_mat().
I think CUDA code change is based on the test result.
SYCL code need same test work before update, instead of copy from CUDA.
As I know, convert fp32 to fp16 in some OPs won't get benefit in SYCL backend.

This PR is used to support IQ2_S, I suggest to minimum the scope as possible. If IQ2_S OP can't be called for IQ2_S model, we need to move the IQ2_S to right position to be called, instead of change whole ggml_sycl_mul_mat() logic.

If change logic in ggml_sycl_mul_mat(), please test CI and llama-bench, make sure the performance on Intel GPU not be impacted. It's more work.

SCYL support IQ2_S is a task, refactor ggml_sycl_mul_mat() is another task. I suggest to rm second task in this PR to reduce the risk.

I agree, the conditional branch is complex , reverted to older legacy code. Testing in progress. Later if refactoring is needed then will engage in separate PR. Yes conversion to fp 16 wont be helpful much ; changes only allocated for iq2 & adjacent quants are applicable here.

AidanBeltonS

Some mul_mat quantization tests are failing for intel + nvidia hardware.

Intel failing quantization types(Data Center Max 1100): iq2_xxs, iq2_xs, iq2_s, iq3_xxs, iq1_s, iq3_s
Nvidia failing quantization types(A100): iq1_s

Failing Tests

Failing on PVC:

  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[1,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[2,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[1,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[2,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
  
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[1,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[2,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[1,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[2,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
  
  MUL_MAT(type_a=iq2_s,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
  MUL_MAT(type_a=iq2_s,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[1,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
  MUL_MAT(type_a=iq2_s,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[2,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
  MUL_MAT(type_a=iq2_s,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[1,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
  MUL_MAT(type_a=iq2_s,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[2,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL

  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[1,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[2,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[1,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[2,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL

  MUL_MAT(type_a=iq1_s,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
  MUL_MAT(type_a=iq1_s,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[1,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
  MUL_MAT(type_a=iq1_s,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[2,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
  MUL_MAT(type_a=iq1_s,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[1,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
  MUL_MAT(type_a=iq1_s,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[2,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL

  MUL_MAT(type_a=iq3_s,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
  MUL_MAT(type_a=iq3_s,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[1,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
  MUL_MAT(type_a=iq3_s,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[2,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
  MUL_MAT(type_a=iq3_s,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[1,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
  MUL_MAT(type_a=iq3_s,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[2,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL

 Failing on NVidia:
  MUL_MAT(type_a=iq1_s,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 14.391691029 > 0.000500000 FAIL
  MUL_MAT(type_a=iq1_s,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[1,1]): [MUL_MAT] NMSE = 17.976997390 > 0.000500000 FAIL
  MUL_MAT(type_a=iq1_s,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[2,1]): [MUL_MAT] NMSE = 9.404314750 > 0.000500000 FAIL
  MUL_MAT(type_a=iq1_s,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[1,1]): [MUL_MAT] NMSE = 8.215036844 > 0.000500000 FAIL
  MUL_MAT(type_a=iq1_s,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[2,1]): [MUL_MAT] NMSE = 10.141738006 > 0.000500000 FAIL

AidanBeltonS · 2024-03-27T15:15:23Z

ggml-sycl.cpp

@@ -4436,6 +4436,24 @@ static void dequantize_block_q6_K(const void * __restrict__ vx, dst_t * __restri
 #endif
 }

+inline bool ggml_sycl_supports_mmq(enum ggml_type type) {


This is an unused function, not sure if it should be added.

AidanBeltonS · 2024-03-27T15:15:40Z

ggml-sycl.cpp

+    //GGML_ASSERT(ggml_nrows(src1) == 1);
+    //GGML_ASSERT(ne10 % QK8_1 == 0);


Delete comments

ggml-sycl.cpp

Co-authored-by: AidanBeltonS <[email protected]>

abhilash1910 · 2024-03-28T08:36:13Z

Some mul_mat quantization tests are failing for intel + nvidia hardware.

Intel failing quantization types(Data Center Max 1100): iq2_xxs, iq2_xs, iq2_s, iq3_xxs, iq1_s, iq3_s Nvidia failing quantization types(A100): iq1_s

Failing Tests

Yes I see the failing tests in iGPU, the failure in iq1s is known as it is buggy .

ggml-sycl.cpp

NeoZhangJianyu · 2024-04-07T03:02:05Z

@abhilash1910
The IQ types in this PR are supported by #6521.
I think it could be closed.

Thank your support!

abhilash1910 · 2024-04-07T03:27:23Z

Fixed in #6521 , thanks @NeoZhangJianyu

iq2_s

08d3b40

abhilash1910 added the Intel GPU label Mar 14, 2024

abhilash1910 added 3 commits March 14, 2024 04:47

iq2_s

9b030b9

bug fix

81b6139

bug fix

0af3ed7

allow iq quant

87e5c86

abhilash1910 mentioned this pull request Mar 16, 2024

[SYCL] Add q3_s and q1_s #5886

Merged

abhilash1910 added 3 commits March 17, 2024 11:23

Merge branch 'ggerganov:master' into iq2_s

f3a3ea1

Merge branch 'ggerganov:master' into iq2_s

1641c52

format

15617b8

AidanBeltonS suggested changes Mar 18, 2024

View reviewed changes

abhilash1910 and others added 5 commits March 18, 2024 22:25

supress assert

32589a6

refactor logic

a553def

fix build

9fa92aa

Merge pull request #6 from ggerganov/iq2_s

7f70fbe

fix build

Merge branch 'ggerganov:master' into iq2_s

4b7aaae

abhilash1910 and others added 6 commits March 19, 2024 03:15

add quants

7466e4e

add quant types from cuda

f5fed74

Merge branch 'ggerganov:master' into iq2_s

36c7f02

fix format

7ea2e15

fix format

551f5a0

explicit add conditions fp32

ada101e

abhilash1910 requested a review from NeoZhangJianyu March 26, 2024 07:11

abhilash1910 added 2 commits March 26, 2024 09:05

refine condition

d4b182c

add conditions

e9377ba

abhilash1910 added 3 commits March 26, 2024 19:39

revert logic

69aaa3d

add condition for iq2s

19772fa

iq2s and other quant logic add

871a135

abhilash1910 added 3 commits March 27, 2024 12:36

disable to check perf

ff4ace5

Merge branch 'ggerganov:master' into iq2_s

8c07b8f

enable with rebase

4e6df37

abhilash1910 requested a review from AidanBeltonS March 27, 2024 11:29

AidanBeltonS suggested changes Mar 27, 2024

View reviewed changes

abhilash1910 and others added 2 commits March 28, 2024 11:06

Update ggml-sycl.cpp

619ce80

Co-authored-by: AidanBeltonS <[email protected]>

add condition

935eabd

abhilash1910 mentioned this pull request Mar 28, 2024

[SYCL] Iq4 nl #6363

Closed

AidanBeltonS closed this Mar 28, 2024

AidanBeltonS reopened this Mar 28, 2024

ikawrakow suggested changes Mar 28, 2024

View reviewed changes

ggml-sycl.cpp Show resolved Hide resolved

abhilash1910 closed this Apr 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL] iq2_s #6052

[SYCL] iq2_s #6052

abhilash1910 commented Mar 14, 2024 •

edited

Loading

AidanBeltonS commented Mar 14, 2024 •

edited

Loading

abhilash1910 commented Mar 14, 2024

NeoZhangJianyu commented Mar 15, 2024

AidanBeltonS left a comment

AidanBeltonS Mar 18, 2024

abhilash1910 Mar 18, 2024

AidanBeltonS Mar 18, 2024 •

edited

Loading

abhilash1910 Mar 19, 2024 •

edited

Loading

NeoZhangJianyu commented Mar 19, 2024

abhilash1910 commented Mar 26, 2024 •

edited

Loading

NeoZhangJianyu commented Mar 27, 2024

abhilash1910 commented Mar 27, 2024

AidanBeltonS left a comment

AidanBeltonS Mar 27, 2024

AidanBeltonS Mar 27, 2024

abhilash1910 commented Mar 28, 2024 •

edited

Loading

NeoZhangJianyu commented Apr 7, 2024

abhilash1910 commented Apr 7, 2024

	const bool use_mul_mat_vec_q = min_compute_capability >= VER_4VEC && ggml_is_quantized(src0->type);
	const bool use_mul_mat_vec_q = min_compute_capability >= VER_4VEC && ggml_is_quantized(src0->type) & ggml_nrows(src1) == 1;

		//GGML_ASSERT(ggml_nrows(src1) == 1);
		//GGML_ASSERT(ne10 % QK8_1 == 0);

[SYCL] iq2_s #6052

[SYCL] iq2_s #6052

Conversation

abhilash1910 commented Mar 14, 2024 • edited Loading

AidanBeltonS commented Mar 14, 2024 • edited Loading

abhilash1910 commented Mar 14, 2024

NeoZhangJianyu commented Mar 15, 2024

AidanBeltonS left a comment

Choose a reason for hiding this comment

AidanBeltonS Mar 18, 2024

Choose a reason for hiding this comment

abhilash1910 Mar 18, 2024

Choose a reason for hiding this comment

AidanBeltonS Mar 18, 2024 • edited Loading

Choose a reason for hiding this comment

abhilash1910 Mar 19, 2024 • edited Loading

Choose a reason for hiding this comment

NeoZhangJianyu commented Mar 19, 2024

abhilash1910 commented Mar 26, 2024 • edited Loading

NeoZhangJianyu commented Mar 27, 2024

abhilash1910 commented Mar 27, 2024

AidanBeltonS left a comment

Choose a reason for hiding this comment

AidanBeltonS Mar 27, 2024

Choose a reason for hiding this comment

AidanBeltonS Mar 27, 2024

Choose a reason for hiding this comment

abhilash1910 commented Mar 28, 2024 • edited Loading

NeoZhangJianyu commented Apr 7, 2024

abhilash1910 commented Apr 7, 2024

abhilash1910 commented Mar 14, 2024 •

edited

Loading

AidanBeltonS commented Mar 14, 2024 •

edited

Loading

AidanBeltonS Mar 18, 2024 •

edited

Loading

abhilash1910 Mar 19, 2024 •

edited

Loading

abhilash1910 commented Mar 26, 2024 •

edited

Loading

abhilash1910 commented Mar 28, 2024 •

edited

Loading