Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml_cuda_cpy support for 4d tensors and float16->float32 upcasting #686

Merged
merged 7 commits into from
Jan 29, 2024

Conversation

balisujohn
Copy link
Contributor

@balisujohn balisujohn commented Jan 8, 2024

This PR adds support for 4d tensors to ggml_cuda_cpy. It seems to work in tortoise.cpp, but I only tested that it compiles in this codebase, so some tests may be necessary to ensure behavior is correct. As with my other PR, I wasn't able to find any existing tests for the cuda backend, so let me know how to add tests for this.

This PR also adds support for upcasting float16 to float32 in the ggml_cuda_cpy operation with the cuda backend.

Fixes this issue: #672

@balisujohn
Copy link
Contributor Author

Alright I added tests for both the 4d copy behavior and the float16->float32 upcast. My new tests seem to pass, though probably someone else should review them.Please let me know if any more polish is needed.

@balisujohn balisujohn changed the title ggml_cuda_cpy support for 4d tensors ggml_cuda_cpy support for 4d tensors and float16->float32 upcasting Jan 12, 2024
@balisujohn
Copy link
Contributor Author

uh so one note is the 4d copy doesn't seem to work with copies from type GGML_TYPE_F32 to type GGML_TYPE_Q8_0

@balisujohn
Copy link
Contributor Author

(aside from the quant types, this is ready for review)

@slaren
Copy link
Collaborator

slaren commented Jan 18, 2024

I tried running test-backend-ops, but got several errors. The quant types need to be fixed.

Backend 2/2 (CUDA0)
  Backend name: CUDA0
  CPY(type_src=f32,type_dst=f32,ne=[256,10,10,1]): OK
  CPY(type_src=f32,type_dst=f16,ne=[256,10,10,1]): OK
  CPY(type_src=f32,type_dst=q4_0,ne=[256,10,10,1]): [CPY] NMSE = 1.910951328 > 0.000000100 sentinel mismatch: sent_2 FAIL
  CPY(type_src=f32,type_dst=q4_1,ne=[256,10,10,1]): [CPY] NMSE = 1.929971995 > 0.000000100 sentinel mismatch: sent_2 FAIL
  CPY(type_src=f32,type_dst=q5_0,ne=[256,10,10,1]): not supported [CUDA0]
  CPY(type_src=f32,type_dst=q5_1,ne=[256,10,10,1]): not supported [CUDA0]
  CPY(type_src=f32,type_dst=q8_0,ne=[256,10,10,1]): [CPY] NMSE = 1.921888075 > 0.000000100 sentinel mismatch: sent_2 sentinel mismatch: sent_3 FAIL
  CPY(type_src=f32,type_dst=q2_K,ne=[256,10,10,1]): not supported [CUDA0]
  CPY(type_src=f32,type_dst=q3_K,ne=[256,10,10,1]): not supported [CUDA0]
  CPY(type_src=f32,type_dst=q4_K,ne=[256,10,10,1]): not supported [CUDA0]
  CPY(type_src=f32,type_dst=q5_K,ne=[256,10,10,1]): not supported [CUDA0]
  CPY(type_src=f32,type_dst=q6_K,ne=[256,10,10,1]): not supported [CUDA0]
  CPY(type_src=f32,type_dst=iq2_xxs,ne=[256,10,10,1]): not supported [CUDA0] not supported [CPU]
  CPY(type_src=f32,type_dst=iq2_xs,ne=[256,10,10,1]): not supported [CUDA0] not supported [CPU]
  CPY(type_src=f16,type_dst=f32,ne=[256,10,10,1]): OK
  CPY(type_src=f32,type_dst=f32,ne=[4,4,4,4]): OK
  CPY(type_src=f32,type_dst=f16,ne=[4,4,4,4]): OK
  1180/1183 tests passed
  Backend CUDA0: FAIL

@iamlemec
Copy link
Contributor

iamlemec commented Jan 25, 2024

All the quantization tests pass if you replace line 4976 in ggml-cuda.cu:

const int dst_offset = i10*nb10 + i11*nb11 + i12*nb12 + i13*nb13;

with

const int dst_offset = (i10/qk)*nb10 + i11*nb11 + i12*nb12 + i13*nb13;

But then it seems that requires that the first dimension of the dst tensor has size divisible by the quantization block size. Is that a typical assumption?

I also tested this PR+fix on a bert.cpp revampt I'm working on and it let's you do all the batched attention stuff on CUDA without reshaping tricks (and the numbers come out right).

Edit: Spoke too soon. That fixes the CPY stuff but breaks MOE on CUDA. I guess it doesn't satisfy the divisbility criterion?

@slaren
Copy link
Collaborator

slaren commented Jan 26, 2024

I think that's a fair assumption. The previous implementation also divided the calculation of i01 by qk.

@ggerganov
Copy link
Owner

Spoke too soon. That fixes the CPY stuff but breaks MOE on CUDA. I guess it doesn't satisfy the divisbility criterion?

How does it break it?

@iamlemec
Copy link
Contributor

Huh, I just recompiled and ran it again and all the tests pass, incuding MOE. So I guess we're all set!

@ggerganov
Copy link
Owner

Ah yes, the MOE test is expected to fail occasionally due to small variations that can lead to selecting different experts between the CPU and the GPU

@ggerganov
Copy link
Owner

@balisujohn Let's apply the proposed patch and we can merge

@ggerganov ggerganov requested a review from slaren January 29, 2024 08:13
@slaren slaren merged commit b2a5c34 into ggerganov:master Jan 29, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants