Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add some new ops, fix some operators and add batch operations to certain operators. #747

Merged
merged 19 commits into from
Mar 3, 2024

Conversation

leejet
Copy link
Contributor

@leejet leejet commented Feb 25, 2024

I tested it in the sd.cpp project, and everything works well.

@FSSRepo
Copy link
Collaborator

FSSRepo commented Feb 27, 2024

@leejet I believe that before merging this, the respective Metal kernels should be created to maintain compatibility, although it doesn't really help much since sd.cpp doesn't benefit from the Metal backend as matrix operations are very slow.

@leejet
Copy link
Contributor Author

leejet commented Feb 27, 2024

OK, I will find time to add support for Metal backend.

@leejet
Copy link
Contributor Author

leejet commented Mar 2, 2024

@FSSRepo I have implemented Metal backend support for ggml_arange and ggml_timestep_embedding, and tested it with the test-arrange and test-timestep-embedding test cases to ensure it works correctly. Could you find some time to review and merge this PR?

@FSSRepo
Copy link
Collaborator

FSSRepo commented Mar 2, 2024

Ok, I will test this PR with latest sd.cpp

@slaren
Copy link
Collaborator

slaren commented Mar 2, 2024

  • Does arange affect the performance significantly? Are there any downsides to constructing the tensor in a CPU buffer and copying it to the backend as an input?
  • Does timestep_embedding have any use outside of SD? And again, would it hurt performance significantly if it was implemented on the CPU and copied to the backend as an input?

@leejet
Copy link
Contributor Author

leejet commented Mar 2, 2024

These two operations won't significantly improve performance, but in stable video diffusion, it's necessary to perform these operations multiple times within the UNet network. Additionally, the parameters may change with each execution. If I don't execute these operations when computing the graph, I would have to remember the tensors produced by each operation and their corresponding parameters. Then, during the ggml_gallocr_alloc_graph, I'd have to copy the data, making the process more complex. Implementing this through ggml operators would make the code more elegant.

Copy link
Collaborator

@slaren slaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to have tests for these operators in test-backend-ops.

src/ggml-cuda.cu Outdated Show resolved Hide resolved
src/ggml-cuda.cu Show resolved Hide resolved
src/ggml.c Show resolved Hide resolved
src/ggml.c Outdated Show resolved Hide resolved
src/ggml.c Outdated Show resolved Hide resolved
src/ggml.c Outdated Show resolved Hide resolved
src/ggml.c Outdated Show resolved Hide resolved
src/ggml.c Outdated Show resolved Hide resolved
src/ggml.c Outdated Show resolved Hide resolved
tests/test-timestep_embedding.cpp Outdated Show resolved Hide resolved
@slaren slaren requested a review from ggerganov March 2, 2024 15:24
Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After adding tests to test-backend-ops we can merge

include/ggml/ggml.h Outdated Show resolved Hide resolved
src/ggml-cuda.cu Outdated Show resolved Hide resolved
src/ggml-metal.m Outdated Show resolved Hide resolved
src/ggml-metal.m Outdated Show resolved Hide resolved
src/ggml-metal.metal Outdated Show resolved Hide resolved
src/ggml.c Outdated
}
}
}
float variance = sum2 / (ne00 * ne01 * step);
float variance = sum2;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the reason to change these normalization to be inside the loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because previously, it calculated the sum of (v * v) and then divided by (ne00 * ne01 * step), I was concerned that in certain situations, it might lead to overflow or precision issues. Therefore, I changed it to calculate the sum of (v * v)/(ne00 * ne01 * step) to obtain the variance.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't cause problems since we accumulate into double (i.e. ggml_float). I've introduced a separate accumulator per row and restored the normalization to be at the end in order to avoid extra division in the loop

leejet and others added 6 commits March 3, 2024 12:46
@leejet
Copy link
Contributor Author

leejet commented Mar 3, 2024

I have modified the code based on the suggestions from the review and added tests to test-backend-ops. There are two uncertain points still awaiting feedback from the maintainer.

@ggerganov ggerganov requested a review from slaren March 3, 2024 07:44
@slaren slaren merged commit 2746808 into ggerganov:master Mar 3, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants