feat: cuda implementation for `ggml_conv_transpose_1d` #854

balisujohn · 2024-06-10T06:58:42Z

adds a cuda implementation for ggml_conv_transpose_1d

may extend the CPU version of the op to allow padding (currently, padding must be zero).

Currently a draft, posting now to prevent duplication of effort, I will remove draft status when ready for review.

… variable stride

balisujohn · 2024-06-13T22:40:37Z

Alright, this is ready for review. The key questions/concerns I have are as follows:

I added some tests for ggml_conv_transpose_1d in test-conv-transpose-1d.cpp some of these tests mirror the conv1 1d transpose tests in test-conv-transpose.c I have verified that the tests in test-conv-transpose-1d.cpp pass in both cuda and CPU mode, but I'm not sure if what's currently in the PR will run the tests in CUDA mode in the CI/CD, I had to remove the manual #define GGML_USE_CUBLASI had added in order for the CI/CD build to work. Curious how best to combine the new tests I added with what's already there, and how to make sure the tests I added are actually running with CUDA in the CI/CD process.

~~There is an extremely bizarre problem I'm encountering when trying to use this op with tortoise, the result is all zeros, but if this line is changed from~~

    dst[global_index] = accumulator;

to

    dst[global_index] += accumulator;

, the result is no longer all zeros.
I'm going to try to isolate the error into a self-contained example in a ggml fork so other people can take a look at it.

Some more background, the last test in test-conv-transpose-1d.cpp uses tensors that are the same shape and calls the op with the same params as the one with the zero behavior in tortoise, and seems to work correctly, so I don't think it's just because of the shape of the tensors and parameters to the op. It seems like src0 and src1 are pointing to zero valued vectors in this case, so even though the logic is correct the accumulator is being set to zero. And for some reason dst is pre-populated with non-zero values, so both += and no assignment to dst result in nonzero values being present in the result.
(The weird bug is fixed)

…e subsequent tests to fail

slaren · 2024-06-14T20:42:36Z

The github runners cannot run CUDA code. There is the ggml-ci can run CUDA tests, but it only monitors branches in this repository. A test should be added to test-backend-ops as well.

forworldm · 2024-06-14T20:45:54Z

src/ggml-cuda/conv-transpose-1d.cu

+
+ int out_index = global_index / dst_ne0;
+
+ int accumulator = 0;


int -> float

yeah that seems to fix the issue I was experiencing; it's bizarre that the tests still passed even in cuda mode with the types accidentally set to int Thanks so much!

forworldm · 2024-06-14T20:46:01Z

src/ggml-cuda/conv-transpose-1d.cu

+
+
+ int kernel_weight = src0[kernel_offset + weight_idx];
+ int input_value = src1[input_offset+i];


int -> float

balisujohn · 2024-06-15T08:51:37Z

I fixed the accumulator bug, added a test to test-backend-ops; ready for review again :^)

ggerganov · 2024-06-16T07:55:56Z

src/ggml-cuda/conv-transpose-1d.cu

+ cudaStream_t stream = ctx.stream();
+
+ GGML_ASSERT(src0->type == GGML_TYPE_F32);
+ GGML_ASSERT( dst->type == GGML_TYPE_F32);


Also assert contiguous src0 and src1

ggerganov

I don't see update of ggml_backend_cuda_supports_op() to indicate that this operation is now supported. While at it, update the rest of the GPU backends to return false in their corresponding supports_op() functions so that the tests do not fail

Maybe add a few more small tests to tests-backend-ops just in case

conv transpose 1d passing test for 1d input and kernel

70de8b7

balisujohn marked this pull request as draft June 10, 2024 07:00

balisujohn added 6 commits June 11, 2024 03:15

working for different input and output channel counts, added test for…

f6883de

… variable stride

initial draft appears to work with stride other than 1

f35d3ec

working with all old and new conv1d tests

53a4fcf

added a test for large tensors

f3bb758

removed use cuda hardcoding

7eff0ab

restored test-conv-transpose.c

152e04e

balisujohn marked this pull request as ready for review June 13, 2024 22:41

removed unused arugments, and fixed bug where test failure would caus…

5d39cd4

…e subsequent tests to fail

forworldm reviewed Jun 14, 2024

View reviewed changes

balisujohn added 2 commits June 15, 2024 03:28

fixed accumulator bug

2e7445e

added test to test-backend-ops

ed3b788

fixed mistake

da3d0d1

ggerganov reviewed Jun 16, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: cuda implementation for `ggml_conv_transpose_1d` #854

feat: cuda implementation for `ggml_conv_transpose_1d` #854

balisujohn commented Jun 10, 2024

balisujohn commented Jun 13, 2024 •

edited

slaren commented Jun 14, 2024

forworldm Jun 14, 2024

balisujohn Jun 14, 2024

forworldm Jun 14, 2024

balisujohn commented Jun 15, 2024

ggerganov Jun 16, 2024

ggerganov left a comment



		int kernel_weight = src0[kernel_offset + weight_idx];
		int input_value = src1[input_offset+i];

feat: cuda implementation for ggml_conv_transpose_1d #854

Are you sure you want to change the base?

feat: cuda implementation for ggml_conv_transpose_1d #854

Conversation

balisujohn commented Jun 10, 2024

balisujohn commented Jun 13, 2024 • edited

slaren commented Jun 14, 2024

forworldm Jun 14, 2024

Choose a reason for hiding this comment

balisujohn Jun 14, 2024

Choose a reason for hiding this comment

forworldm Jun 14, 2024

Choose a reason for hiding this comment

balisujohn commented Jun 15, 2024

ggerganov Jun 16, 2024

Choose a reason for hiding this comment

ggerganov left a comment

Choose a reason for hiding this comment

feat: cuda implementation for `ggml_conv_transpose_1d` #854

feat: cuda implementation for `ggml_conv_transpose_1d` #854

balisujohn commented Jun 13, 2024 •

edited