Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL] iq2_s #6052

Closed
wants to merge 29 commits into from
Closed

[SYCL] iq2_s #6052

wants to merge 29 commits into from

Conversation

abhilash1910
Copy link
Collaborator

@abhilash1910 abhilash1910 commented Mar 14, 2024

Initial Support for IQ2_S . Requires validation before merge.
@NeoZhangJianyu @airMeng @ggerganov @AidanBeltonS

@AidanBeltonS
Copy link
Contributor

AidanBeltonS commented Mar 14, 2024

Tested on A100 GPU and Intel PVC.
This fails due to missing case in ggml_sycl_op_dequantize_mul_mat_vec

MUL_MAT(type_a=iq2_s,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[1,1]): GGML_ASSERT: /home/aidanbelton/source/temp/llama.cpp/ggml-sycl.cpp:14085: false

@abhilash1910
Copy link
Collaborator Author

Yes this is known for other quants as well, I will fix this .

@NeoZhangJianyu
Copy link
Collaborator

Yes this is known for other quants as well, I will fix this .

What's the performance change with this PR?
Can you test with llama-bench on Intel GPU? And compare the data with legacy.

Copy link
Contributor

@AidanBeltonS AidanBeltonS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently breaks both supported target vendors

@@ -15435,7 +15630,7 @@ static void ggml_sycl_mul_mat(const ggml_tensor * src0, const ggml_tensor * src1
#ifdef GGML_SYCL_FORCE_DMMV
const bool use_mul_mat_vec_q = false;
#else
const bool use_mul_mat_vec_q = min_compute_capability >= VER_4VEC && ggml_is_quantized(src0->type) && ggml_nrows(src1) == 1;
const bool use_mul_mat_vec_q = min_compute_capability >= VER_4VEC && ggml_is_quantized(src0->type);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const bool use_mul_mat_vec_q = min_compute_capability >= VER_4VEC && ggml_is_quantized(src0->type);
const bool use_mul_mat_vec_q = min_compute_capability >= VER_4VEC && ggml_is_quantized(src0->type) & ggml_nrows(src1) == 1;

This is a breaking change for both Intel and Nvidia targets. ggml_sycl_op_mul_mat_vec_q asserts that GGML_ASSERT(ggml_nrows(src1) == 1);

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes actually the assertion should not be mandatory and ggml_nrows=1 wont allow the iq stages to run.

Copy link
Contributor

@AidanBeltonS AidanBeltonS Mar 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iv tested with the suppression of the assert. It fails on both Intel and Nvidia targets. In both cases it is due to the function lacking the case for your new quantization type

Intel (Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1100 1.3 [1.3.28454]):
There is a failure in ggml_sycl_op_dequantize_mul_mat_vec of
MUL_MAT(type_a=iq2_s,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1]): GGML_ASSERT: /home/aidanbelton/source/temp/llama.cpp/ggml-sycl.cpp:14217: false

Nvidia (NVIDIA A100-PCIE-40GB 8.0 [CUDA 12.2]):
There is a failure in ggml_sycl_op_mul_mat_q of MUL_MAT(type_a=iq2_s,type_b=f32,m=16,n=16,k=256,bs=[1,1],nr=[1,1]): GGML_ASSERT: /home/aidanbelton/source/temp/llama.cpp/ggml-sycl.cpp:14019: false. Issue is missing case for quantization type

Copy link
Collaborator Author

@abhilash1910 abhilash1910 Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually for both the cases, it should go to neither of the methods. It is strange that it is falling back to either dequantize_mul_mat_vec or op_mul_mat_q path. For this type only vectorised mul_mat_q-mmvq should be called.

ggml-sycl.cpp Outdated Show resolved Hide resolved
@NeoZhangJianyu
Copy link
Collaborator

@abhilash1910 I test a basic model llama2-7b-Q4 with your code. it's fault.
Could you check your code?

I feel your solution change more legacy code, instead of only impacting IQ2_S data type.
That will make it complex and be easy to fault.

And in ggml_sycl_mul_mat(), the fp32 and fp16 branchs are removed. I'm afraid it will impact the legacy (Q4) performance.
Could you run llama-bench to test the performance of Q4 won't reduce.

I suggest you update the code with minimum change to impact IQ2_S type only.
That will reduce the risk to impact legacy.

@abhilash1910
Copy link
Collaborator Author

abhilash1910 commented Mar 26, 2024

@abhilash1910 I test a basic model llama2-7b-Q4 with your code. it's fault. Could you check your code?

I feel your solution change more legacy code, instead of only impacting IQ2_S data type. That will make it complex and be easy to fault.

And in ggml_sycl_mul_mat(), the fp32 and fp16 branchs are removed. I'm afraid it will impact the legacy (Q4) performance. Could you run llama-bench to test the performance of Q4 won't reduce.

I suggest you update the code with minimum change to impact IQ2_S type only. That will reduce the risk to impact legacy.

In this case, the logic flow is kept unchanged w.r.t fp32/16 dtypes . I think it wont impact the legacy functionality and would rather send the src types to correct branch (dmmq/mmq or mmvq) based on the implemented branching rules in cuda source. This would help enable other quants through mmvq pathway as well.

@NeoZhangJianyu
Copy link
Collaborator

  1. How about CI passed?
  2. Why change ggml_sycl_mul_mat()? I see the logic is complex than latest ggml_cuda_mul_mat().
    I think CUDA code change is based on the test result.
    SYCL code need same test work before update, instead of copy from CUDA.
    As I know, convert fp32 to fp16 in some OPs won't get benefit in SYCL backend.

This PR is used to support IQ2_S, I suggest to minimum the scope as possible.
If IQ2_S OP can't be called for IQ2_S model, we need to move the IQ2_S to right position to be called, instead of change whole ggml_sycl_mul_mat() logic.

If change logic in ggml_sycl_mul_mat(), please test CI and llama-bench, make sure the performance on Intel GPU not be impacted. It's more work.

SCYL support IQ2_S is a task, refactor ggml_sycl_mul_mat() is another task.
I suggest to rm second task in this PR to reduce the risk.

@abhilash1910
Copy link
Collaborator Author

  1. How about CI passed?
  2. Why change ggml_sycl_mul_mat()? I see the logic is complex than latest ggml_cuda_mul_mat().
    I think CUDA code change is based on the test result.
    SYCL code need same test work before update, instead of copy from CUDA.
    As I know, convert fp32 to fp16 in some OPs won't get benefit in SYCL backend.

This PR is used to support IQ2_S, I suggest to minimum the scope as possible. If IQ2_S OP can't be called for IQ2_S model, we need to move the IQ2_S to right position to be called, instead of change whole ggml_sycl_mul_mat() logic.

If change logic in ggml_sycl_mul_mat(), please test CI and llama-bench, make sure the performance on Intel GPU not be impacted. It's more work.

SCYL support IQ2_S is a task, refactor ggml_sycl_mul_mat() is another task. I suggest to rm second task in this PR to reduce the risk.

I agree, the conditional branch is complex , reverted to older legacy code. Testing in progress. Later if refactoring is needed then will engage in separate PR. Yes conversion to fp 16 wont be helpful much ; changes only allocated for iq2 & adjacent quants are applicable here.

Copy link
Contributor

@AidanBeltonS AidanBeltonS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some mul_mat quantization tests are failing for intel + nvidia hardware.

Intel failing quantization types(Data Center Max 1100): iq2_xxs, iq2_xs, iq2_s, iq3_xxs, iq1_s, iq3_s
Nvidia failing quantization types(A100): iq1_s

Failing Tests

Failing on PVC:

  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[1,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[2,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[1,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[2,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
  
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[1,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[2,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[1,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[2,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
  
  MUL_MAT(type_a=iq2_s,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
  MUL_MAT(type_a=iq2_s,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[1,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
  MUL_MAT(type_a=iq2_s,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[2,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
  MUL_MAT(type_a=iq2_s,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[1,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
  MUL_MAT(type_a=iq2_s,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[2,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL

  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[1,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[2,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[1,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[2,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL

  MUL_MAT(type_a=iq1_s,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
  MUL_MAT(type_a=iq1_s,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[1,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
  MUL_MAT(type_a=iq1_s,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[2,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
  MUL_MAT(type_a=iq1_s,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[1,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
  MUL_MAT(type_a=iq1_s,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[2,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL

  MUL_MAT(type_a=iq3_s,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
  MUL_MAT(type_a=iq3_s,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[1,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
  MUL_MAT(type_a=iq3_s,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[2,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
  MUL_MAT(type_a=iq3_s,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[1,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL
  MUL_MAT(type_a=iq3_s,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[2,1]): [MUL_MAT] NMSE = inf > 0.000500000 FAIL

 Failing on NVidia:
  MUL_MAT(type_a=iq1_s,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1]): [MUL_MAT] NMSE = 14.391691029 > 0.000500000 FAIL
  MUL_MAT(type_a=iq1_s,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[1,1]): [MUL_MAT] NMSE = 17.976997390 > 0.000500000 FAIL
  MUL_MAT(type_a=iq1_s,type_b=f32,m=16,n=1,k=256,bs=[10,1],nr=[2,1]): [MUL_MAT] NMSE = 9.404314750 > 0.000500000 FAIL
  MUL_MAT(type_a=iq1_s,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[1,1]): [MUL_MAT] NMSE = 8.215036844 > 0.000500000 FAIL
  MUL_MAT(type_a=iq1_s,type_b=f32,m=16,n=1,k=256,bs=[10,10],nr=[2,1]): [MUL_MAT] NMSE = 10.141738006 > 0.000500000 FAIL

@@ -4436,6 +4436,24 @@ static void dequantize_block_q6_K(const void * __restrict__ vx, dst_t * __restri
#endif
}

inline bool ggml_sycl_supports_mmq(enum ggml_type type) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an unused function, not sure if it should be added.

Comment on lines +13844 to +13845
//GGML_ASSERT(ggml_nrows(src1) == 1);
//GGML_ASSERT(ne10 % QK8_1 == 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete comments

ggml-sycl.cpp Outdated Show resolved Hide resolved
@abhilash1910 abhilash1910 mentioned this pull request Mar 28, 2024
@abhilash1910
Copy link
Collaborator Author

abhilash1910 commented Mar 28, 2024

Some mul_mat quantization tests are failing for intel + nvidia hardware.

Intel failing quantization types(Data Center Max 1100): iq2_xxs, iq2_xs, iq2_s, iq3_xxs, iq1_s, iq3_s Nvidia failing quantization types(A100): iq1_s

Failing Tests

Yes I see the failing tests in iGPU, the failure in iq1s is known as it is buggy .

ggml-sycl.cpp Show resolved Hide resolved
@NeoZhangJianyu
Copy link
Collaborator

@abhilash1910
The IQ types in this PR are supported by #6521.
I think it could be closed.

Thank your support!

@abhilash1910
Copy link
Collaborator Author

Fixed in #6521 , thanks @NeoZhangJianyu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants