-
Notifications
You must be signed in to change notification settings - Fork 238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[featured] Support for Metal #108
Comments
In ggml, efforts are underway to add the missing kernels in Metal. Perhaps, when this pull request is merged, I will add Metal support to stable diffusion. |
Nice :) Im only waiting for this to add to my app https://apps.apple.com/us/app/aikit-ai-tools-easy/id6470067977 Anyone can test there for free other things and in future the SD. |
@paulocoutinhox you can try enable metal backend with PR #104 |
Hi, Im trying and it works on macOS with M1.
But on iOS the exactly same C++ code i get:
|
It looks like Metal buffers cannot exceed 1024 MB in this device. It may be possible to work around the issue by splitting the weights into multiple buffers. From the ggml-backend side, we could add support for doing this automatically in |
Hi, After search the problem is related with Exemplo:
There is a lock that don't let you create a buffer with more than 1024MB. |
I think it could be solved by separating the weights of convolutions and attention weights into different buffers, although it implies a quite cumbersome change due to the way tensor memory is allocated. |
Just for my info - is this limit specific to M1 Pro. For example, @slaren is it a different limit on the M3? Edit: nvm, it's the A15 GPU that has the limit |
It's IOS, Apple A15 GPU, in M1, M3 seems work, but very slower for some reason. |
I haven't tested it, but I think it is mostly a limit on iOS devices. It should be possible to query the limit with |
Ah yes, we already have most of the logic there. |
I believe the bottleneck in M1, M3 is matrix multiplication, as stable diffusion requires very large matrix multiplications. In CUDA, these are done in batches across the number of heads, making it quite fast. In Metal, they are performed sequentially, I assume |
Doing some quick runs, there a few cases that are not currently optimized in the Metal backend:
# example 1
ggml_metal_graph_compute_block_invoke: op - MUL_MAT
ggml_metal_graph_compute_block_invoke: src0 - f32 [ 40, 77, 8], 1, (reshaped) (permuted) (cont) (reshaped)
ggml_metal_graph_compute_block_invoke: src1 - f32 [ 40, 4096, 8], 1, (view) (reshaped) (permuted) (cont) (reshaped)
ggml_metal_graph_compute_block_invoke: dst - f32 [ 77, 4096, 8], 1, node_2274
# example 2
ggml_metal_graph_compute_block_invoke: op - MUL_MAT
ggml_metal_graph_compute_block_invoke: src0 - f32 [ 77, 40, 8], 1, (reshaped) (permuted) (cont) (reshaped)
ggml_metal_graph_compute_block_invoke: src1 - f32 [ 77, 4096, 8], 1, (view)
ggml_metal_graph_compute_block_invoke: dst - f32 [ 40, 4096, 8], 1, node_2276
# example 3
ggml_metal_graph_compute_block_invoke: op - MUL_MAT
ggml_metal_graph_compute_block_invoke: src0 - f32 [ 40, 4096, 8], 1, (reshaped) (permuted) (cont) (reshaped)
ggml_metal_graph_compute_block_invoke: src1 - f32 [ 40, 4096, 8], 1, (view) (reshaped) (permuted) (cont) (reshaped)
ggml_metal_graph_compute_block_invoke: dst - f32 [ 4096, 4096, 8], 1, node_2253
# example 1
ggml_metal_graph_compute_block_invoke: op - MUL_MAT
ggml_metal_graph_compute_block_invoke: src0 - f16 [ 2880, 4096, 1], 1, (reshaped)
ggml_metal_graph_compute_block_invoke: src1 - f16 [ 2880, 320, 1], 1, leaf_36 (reshaped)
ggml_metal_graph_compute_block_invoke: dst - f32 [ 4096, 320, 1], 1, node_2206
# example 2
ggml_metal_graph_compute_block_invoke: op - MUL_MAT
ggml_metal_graph_compute_block_invoke: src0 - f16 [ 960, 4096, 1], 1, (reshaped)
ggml_metal_graph_compute_block_invoke: src1 - f16 [ 960, 320, 1], 1, leaf_629 (reshaped)
ggml_metal_graph_compute_block_invoke: dst - f32 [ 4096, 320, 1], 1, node_2213
# example 3
ggml_metal_graph_compute_block_invoke: op - MUL_MAT
ggml_metal_graph_compute_block_invoke: src0 - f16 [ 320, 4096, 1], 1, (reshaped)
ggml_metal_graph_compute_block_invoke: src1 - f16 [ 320, 320, 1], 1, leaf_35 (reshaped)
ggml_metal_graph_compute_block_invoke: dst - f32 [ 4096, 320, 1], 1, node_2226 All of these are currently sub-optimal and probably can be improved significantly with better kernels. |
Fixed. Thanks. |
Hi,
Can you add support for Metal?
Thanks.
The text was updated successfully, but these errors were encountered: