`ggml_cuda_cpy` support for 4d tensors and float16->float32 upcasting #686

balisujohn · 2024-01-08T02:33:17Z

This PR adds support for 4d tensors to ggml_cuda_cpy. It seems to work in tortoise.cpp, but I only tested that it compiles in this codebase, so some tests may be necessary to ensure behavior is correct. As with my other PR, I wasn't able to find any existing tests for the cuda backend, so let me know how to add tests for this.

This PR also adds support for upcasting float16 to float32 in the ggml_cuda_cpy operation with the cuda backend.

Fixes this issue: #672

balisujohn · 2024-01-12T05:46:26Z

Alright I added tests for both the 4d copy behavior and the float16->float32 upcast. My new tests seem to pass, though probably someone else should review them.Please let me know if any more polish is needed.

balisujohn · 2024-01-12T07:18:16Z

uh so one note is the 4d copy doesn't seem to work with copies from type GGML_TYPE_F32 to type GGML_TYPE_Q8_0

balisujohn · 2024-01-16T00:42:32Z

(aside from the quant types, this is ready for review)

slaren · 2024-01-18T17:18:32Z

I tried running test-backend-ops, but got several errors. The quant types need to be fixed.

Backend 2/2 (CUDA0)
  Backend name: CUDA0
  CPY(type_src=f32,type_dst=f32,ne=[256,10,10,1]): OK
  CPY(type_src=f32,type_dst=f16,ne=[256,10,10,1]): OK
  CPY(type_src=f32,type_dst=q4_0,ne=[256,10,10,1]): [CPY] NMSE = 1.910951328 > 0.000000100 sentinel mismatch: sent_2 FAIL
  CPY(type_src=f32,type_dst=q4_1,ne=[256,10,10,1]): [CPY] NMSE = 1.929971995 > 0.000000100 sentinel mismatch: sent_2 FAIL
  CPY(type_src=f32,type_dst=q5_0,ne=[256,10,10,1]): not supported [CUDA0]
  CPY(type_src=f32,type_dst=q5_1,ne=[256,10,10,1]): not supported [CUDA0]
  CPY(type_src=f32,type_dst=q8_0,ne=[256,10,10,1]): [CPY] NMSE = 1.921888075 > 0.000000100 sentinel mismatch: sent_2 sentinel mismatch: sent_3 FAIL
  CPY(type_src=f32,type_dst=q2_K,ne=[256,10,10,1]): not supported [CUDA0]
  CPY(type_src=f32,type_dst=q3_K,ne=[256,10,10,1]): not supported [CUDA0]
  CPY(type_src=f32,type_dst=q4_K,ne=[256,10,10,1]): not supported [CUDA0]
  CPY(type_src=f32,type_dst=q5_K,ne=[256,10,10,1]): not supported [CUDA0]
  CPY(type_src=f32,type_dst=q6_K,ne=[256,10,10,1]): not supported [CUDA0]
  CPY(type_src=f32,type_dst=iq2_xxs,ne=[256,10,10,1]): not supported [CUDA0] not supported [CPU]
  CPY(type_src=f32,type_dst=iq2_xs,ne=[256,10,10,1]): not supported [CUDA0] not supported [CPU]
  CPY(type_src=f16,type_dst=f32,ne=[256,10,10,1]): OK
  CPY(type_src=f32,type_dst=f32,ne=[4,4,4,4]): OK
  CPY(type_src=f32,type_dst=f16,ne=[4,4,4,4]): OK
  1180/1183 tests passed
  Backend CUDA0: FAIL

iamlemec · 2024-01-25T22:19:38Z

All the quantization tests pass if you replace line 4976 in ggml-cuda.cu:

const int dst_offset = i10*nb10 + i11*nb11 + i12*nb12 + i13*nb13;

with

const int dst_offset = (i10/qk)*nb10 + i11*nb11 + i12*nb12 + i13*nb13;

But then it seems that requires that the first dimension of the dst tensor has size divisible by the quantization block size. Is that a typical assumption?

I also tested this PR+fix on a bert.cpp revampt I'm working on and it let's you do all the batched attention stuff on CUDA without reshaping tricks (and the numbers come out right).

Edit: Spoke too soon. That fixes the CPY stuff but breaks MOE on CUDA. I guess it doesn't satisfy the divisbility criterion?

slaren · 2024-01-26T03:41:24Z

I think that's a fair assumption. The previous implementation also divided the calculation of i01 by qk.

ggerganov · 2024-01-26T13:12:08Z

Spoke too soon. That fixes the CPY stuff but breaks MOE on CUDA. I guess it doesn't satisfy the divisbility criterion?

How does it break it?

iamlemec · 2024-01-26T16:10:50Z

Huh, I just recompiled and ran it again and all the tests pass, incuding MOE. So I guess we're all set!

ggerganov · 2024-01-26T17:22:59Z

Ah yes, the MOE test is expected to fail occasionally due to small variations that can lead to selecting different experts between the CPU and the GPU

ggerganov · 2024-01-26T17:23:24Z

@balisujohn Let's apply the proposed patch and we can merge

tests/test-backend-ops.cpp

balisujohn added 2 commits January 7, 2024 16:39

added cuda float16->float32 upcasting to ggml_cuda_cpy

4bd90ec

added ability to copy 4d tensors with the cuda backend

3d6c347

slaren mentioned this pull request Jan 8, 2024

added cuda float16->float32 upcasting to ggml_cuda_cpy #685

Closed

added tests for float16_>float32 upcast and 4d tensor cuda copys

4bbaad2

balisujohn changed the title ~~ggml_cuda_cpy support for 4d tensors~~ ggml_cuda_cpy support for 4d tensors and float16->float32 upcasting Jan 12, 2024

added 4d copy test for float32->float16 copy

a5b1cca

balisujohn added 2 commits January 29, 2024 00:53

applied patch suggested by @iamlemec

3c7f070

Merge branch 'master' into dev-cuda-4d-copy

ab64a6b

ggerganov requested a review from slaren January 29, 2024 08:13

ggerganov reviewed Jan 29, 2024

View reviewed changes

tests/test-backend-ops.cpp Outdated Show resolved Hide resolved

ggerganov approved these changes Jan 29, 2024

View reviewed changes

simplify cpy tests

42b9cea

slaren merged commit b2a5c34 into ggerganov:master Jan 29, 2024
4 checks passed

slaren mentioned this pull request Feb 1, 2024

ggml : add Flash Attention ggerganov/llama.cpp#5021

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`ggml_cuda_cpy` support for 4d tensors and float16->float32 upcasting #686

`ggml_cuda_cpy` support for 4d tensors and float16->float32 upcasting #686

balisujohn commented Jan 8, 2024 •

edited

Loading

balisujohn commented Jan 12, 2024

balisujohn commented Jan 12, 2024

balisujohn commented Jan 16, 2024

slaren commented Jan 18, 2024

iamlemec commented Jan 25, 2024 •

edited

Loading

slaren commented Jan 26, 2024

ggerganov commented Jan 26, 2024

iamlemec commented Jan 26, 2024

ggerganov commented Jan 26, 2024

ggerganov commented Jan 26, 2024

ggml_cuda_cpy support for 4d tensors and float16->float32 upcasting #686

ggml_cuda_cpy support for 4d tensors and float16->float32 upcasting #686

Conversation

balisujohn commented Jan 8, 2024 • edited Loading

balisujohn commented Jan 12, 2024

balisujohn commented Jan 12, 2024

balisujohn commented Jan 16, 2024

slaren commented Jan 18, 2024

iamlemec commented Jan 25, 2024 • edited Loading

slaren commented Jan 26, 2024

ggerganov commented Jan 26, 2024

iamlemec commented Jan 26, 2024

ggerganov commented Jan 26, 2024

ggerganov commented Jan 26, 2024

`ggml_cuda_cpy` support for 4d tensors and float16->float32 upcasting #686

`ggml_cuda_cpy` support for 4d tensors and float16->float32 upcasting #686

balisujohn commented Jan 8, 2024 •

edited

Loading

iamlemec commented Jan 25, 2024 •

edited

Loading