[featured] Support for Metal #108

paulocoutinhox · 2023-12-08T06:46:59Z

Hi,

Can you add support for Metal?

Thanks.

FSSRepo · 2023-12-08T18:00:33Z

In ggml, efforts are underway to add the missing kernels in Metal. Perhaps, when this pull request is merged, I will add Metal support to stable diffusion.

paulocoutinhox · 2023-12-08T18:23:01Z

Nice :) Im only waiting for this to add to my app https://apps.apple.com/us/app/aikit-ai-tools-easy/id6470067977

Anyone can test there for free other things and in future the SD.

FSSRepo · 2023-12-11T01:42:00Z

@paulocoutinhox you can try enable metal backend with PR #104

paulocoutinhox · 2023-12-12T22:41:59Z

Hi,

Im trying and it works on macOS with M1.

[2023-12-12 17:52:15.113] [debug] [MappingStableDiffusion : callbackGenerate] Generating image...
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Pro
ggml_metal_init: picking default device: Apple M1 Pro
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/paulo/Developer/workspaces/cpp/build-ai-kit-Desktop_arm_darwin_generic_mach_o_64bit-Debug/bin/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M1 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 11453.25 MB
ggml_metal_init: maxTransferRate               = built-in GPU
[INFO]  stable-diffusion.cpp:4996 - loading model from '/Users/paulo/Downloads/dreamshaper_8.safetensors'
[INFO]  model.cpp:627  - load /Users/paulo/Downloads/dreamshaper_8.safetensors using safetensors format
[INFO]  stable-diffusion.cpp:5019 - Stable Diffusion 1.x 
[INFO]  stable-diffusion.cpp:5025 - Stable Diffusion weight type: f16
[INFO]  stable-diffusion.cpp:5179 - total memory buffer size = 1972.80MB (clip 236.18MB, unet 1641.16MB, vae 95.47MB)
[INFO]  stable-diffusion.cpp:5181 - loading model from '/Users/paulo/Downloads/dreamshaper_8.safetensors' completed, taking 1.44s
[INFO]  stable-diffusion.cpp:5195 - running in eps-prediction mode
Option: 
    n_threads:         8
    mode:              0
    model_path:        /Users/paulo/Downloads/dreamshaper_8.safetensors
    wtype:             unspecified
    vae_path:          
    taesd_path:        
    output_path:       output.png
    init_img:          
    prompt:            a cat with blue eyes
    negative_prompt:   
    cfg_scale:         7.00
    width:             512
    height:            512
    sample_method:     0
    schedule:          0
    sample_steps:      20
    strength(img2img): 0.75
    rng:               1
    seed:              42
    batch_count:       1
[INFO]  stable-diffusion.cpp:6034 - apply_loras completed, taking 0.00s
[INFO]  stable-diffusion.cpp:6063 - get_learned_condition completed, taking 108 ms
[INFO]  stable-diffusion.cpp:6073 - sampling using Euler A method
[INFO]  stable-diffusion.cpp:6077 - generating image: 1/1 - seed 42
  |==================================================| 20/20 - 6.45s/it
[INFO]  stable-diffusion.cpp:6089 - sampling completed, taking 131.88s
[INFO]  stable-diffusion.cpp:6097 - generating 1 latent images completed, taking 131.89s
[INFO]  stable-diffusion.cpp:6099 - decoding 1 latents
[INFO]  stable-diffusion.cpp:6111 - latent 1 decoded, taking 10.99s
[INFO]  stable-diffusion.cpp:6115 - decode_first_stage completed, taking 10.99s
[INFO]  stable-diffusion.cpp:6122 - txt2img completed in 142.99s
[2023-12-12 17:54:41.942] [debug] Image generated
Returned Value: OK
Returned Image Size: 341542 bytes / 512x512

But on iOS the exactly same C++ code i get:

ggml_metal_init: allocating
ggml_metal_init: picking default device: Apple A15 GPU
ggml_metal_init: loading '/private/var/containers/Bundle/Application/2C05218B-7DEA-41E7-AC7B-34A067581165/AiKit.app/default.metallib'
[INFO]  stable-diffusion.cpp:4996 - loading model from '/private/var/mobile/Library/Mobile Documents/com~apple~CloudDocs/Documents/Testes/dreamshaper_8.safetensors'
[INFO]  model.cpp:627  - load /private/var/mobile/Library/Mobile Documents/com~apple~CloudDocs/Documents/Testes/dreamshaper_8.safetensors using safetensors format
[INFO]  stable-diffusion.cpp:5019 - Stable Diffusion 1.x 
[INFO]  stable-diffusion.cpp:5025 - Stable Diffusion weight type: f16
-[MTLDebugDevice newBufferWithBytesNoCopy:length:options:deallocator:]:700: failed assertion `Buffer Validation
newBufferWith*:length 0x6692c000 must not exceed 1024 MB.

slaren · 2023-12-13T00:50:34Z

It looks like Metal buffers cannot exceed 1024 MB in this device. It may be possible to work around the issue by splitting the weights into multiple buffers. From the ggml-backend side, we could add support for doing this automatically in ggml_backend_alloc_ctx_tensors.

paulocoutinhox · 2023-12-13T02:02:54Z

Hi,

After search the problem is related with newBufferWithBytesNoCopy.

Exemplo:

ctx->buffers[ctx->n_buffers].size // <-- this cannot be more than 1024MB
ctx->buffers[ctx->n_buffers].metal = [ctx->device newBufferWithBytesNoCopy [...]

There is a lock that don't let you create a buffer with more than 1024MB.

FSSRepo · 2023-12-13T02:23:00Z

I think it could be solved by separating the weights of convolutions and attention weights into different buffers, although it implies a quite cumbersome change due to the way tensor memory is allocated.

ggerganov · 2023-12-13T12:23:47Z

Just for my info - is this limit specific to M1 Pro. For example, @slaren is it a different limit on the M3?

Edit: nvm, it's the A15 GPU that has the limit

FSSRepo · 2023-12-13T12:26:27Z

Just for my info - is this limit specific to M1 Pro. For example, @slaren is it a different limit on the M3?

It's IOS, Apple A15 GPU, in M1, M3 seems work, but very slower for some reason.

slaren · 2023-12-13T12:27:36Z

I haven't tested it, but I think it is mostly a limit on iOS devices. It should be possible to query the limit with device.maxBufferLength. From what I understand, ggml_metal_add_buffer already does this, we just need to support this in ggml-backend.

ggerganov · 2023-12-13T12:29:32Z

From what I understand, ggml_metal_add_buffer already does this, we just need to support this in ggml-backend.

Ah yes, we already have most of the logic there.

FSSRepo · 2023-12-13T12:31:16Z

I believe the bottleneck in M1, M3 is matrix multiplication, as stable diffusion requires very large matrix multiplications. In CUDA, these are done in batches across the number of heads, making it quite fast. In Metal, they are performed sequentially, I assume

ggerganov · 2023-12-13T13:18:17Z

Doing some quick runs, there a few cases that are not currently optimized in the Metal backend:

ne00 % 32 != 0. In these cases, Metal will currently use the mat-vec kernel instead of the mat-mat kernel. These can probably be solved by padding

# example 1
ggml_metal_graph_compute_block_invoke: op - MUL_MAT
ggml_metal_graph_compute_block_invoke: src0 -  f32 [   40,    77,     8], 1,  (reshaped) (permuted) (cont) (reshaped)
ggml_metal_graph_compute_block_invoke: src1 -  f32 [   40,  4096,     8], 1,  (view) (reshaped) (permuted) (cont) (reshaped)
ggml_metal_graph_compute_block_invoke: dst  -  f32 [   77,  4096,     8], 1, node_2274

# example 2
ggml_metal_graph_compute_block_invoke: op - MUL_MAT
ggml_metal_graph_compute_block_invoke: src0 -  f32 [   77,    40,     8], 1,  (reshaped) (permuted) (cont) (reshaped)
ggml_metal_graph_compute_block_invoke: src1 -  f32 [   77,  4096,     8], 1,  (view)
ggml_metal_graph_compute_block_invoke: dst  -  f32 [   40,  4096,     8], 1, node_2276

# example 3
ggml_metal_graph_compute_block_invoke: op - MUL_MAT
ggml_metal_graph_compute_block_invoke: src0 -  f32 [   40,  4096,     8], 1,  (reshaped) (permuted) (cont) (reshaped)
ggml_metal_graph_compute_block_invoke: src1 -  f32 [   40,  4096,     8], 1,  (view) (reshaped) (permuted) (cont) (reshaped)
ggml_metal_graph_compute_block_invoke: dst  -  f32 [ 4096,  4096,     8], 1, node_2253

F16 x F16. These again will fallback to the mat-vec kernel, because we currently support only * x F32 mat-mat multiplications

# example 1
ggml_metal_graph_compute_block_invoke: op - MUL_MAT
ggml_metal_graph_compute_block_invoke: src0 -  f16 [ 2880,  4096,     1], 1,  (reshaped)
ggml_metal_graph_compute_block_invoke: src1 -  f16 [ 2880,   320,     1], 1, leaf_36 (reshaped)
ggml_metal_graph_compute_block_invoke: dst  -  f32 [ 4096,   320,     1], 1, node_2206

# example 2
ggml_metal_graph_compute_block_invoke: op - MUL_MAT
ggml_metal_graph_compute_block_invoke: src0 -  f16 [  960,  4096,     1], 1,  (reshaped)
ggml_metal_graph_compute_block_invoke: src1 -  f16 [  960,   320,     1], 1, leaf_629 (reshaped)
ggml_metal_graph_compute_block_invoke: dst  -  f32 [ 4096,   320,     1], 1, node_2213

# example 3
ggml_metal_graph_compute_block_invoke: op - MUL_MAT
ggml_metal_graph_compute_block_invoke: src0 -  f16 [  320,  4096,     1], 1,  (reshaped)
ggml_metal_graph_compute_block_invoke: src1 -  f16 [  320,   320,     1], 1, leaf_35 (reshaped)
ggml_metal_graph_compute_block_invoke: dst  -  f32 [ 4096,   320,     1], 1, node_2226

All of these are currently sub-optimal and probably can be improved significantly with better kernels.

paulocoutinhox · 2024-06-01T17:50:09Z

Fixed. Thanks.

paulocoutinhox mentioned this issue Dec 12, 2023

stable-diffusion: implement ESRGAN upscaler + Metal Backend #104

Merged

paulocoutinhox mentioned this issue Dec 13, 2023

Problems with Metal and iOS ggerganov/ggml#647

Closed

FSSRepo mentioned this issue Dec 29, 2023

cuda : ggml_mul_mat assert for padded src1 ggerganov/ggml#673

Draft

paulocoutinhox closed this as completed Jun 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[featured] Support for Metal #108

[featured] Support for Metal #108

paulocoutinhox commented Dec 8, 2023

FSSRepo commented Dec 8, 2023

paulocoutinhox commented Dec 8, 2023

FSSRepo commented Dec 11, 2023

paulocoutinhox commented Dec 12, 2023

slaren commented Dec 13, 2023

paulocoutinhox commented Dec 13, 2023

FSSRepo commented Dec 13, 2023

ggerganov commented Dec 13, 2023 •

edited

Loading

FSSRepo commented Dec 13, 2023 •

edited

Loading

slaren commented Dec 13, 2023

ggerganov commented Dec 13, 2023

FSSRepo commented Dec 13, 2023

ggerganov commented Dec 13, 2023

paulocoutinhox commented Jun 1, 2024

[featured] Support for Metal #108

[featured] Support for Metal #108

Comments

paulocoutinhox commented Dec 8, 2023

FSSRepo commented Dec 8, 2023

paulocoutinhox commented Dec 8, 2023

FSSRepo commented Dec 11, 2023

paulocoutinhox commented Dec 12, 2023

slaren commented Dec 13, 2023

paulocoutinhox commented Dec 13, 2023

FSSRepo commented Dec 13, 2023

ggerganov commented Dec 13, 2023 • edited Loading

FSSRepo commented Dec 13, 2023 • edited Loading

slaren commented Dec 13, 2023

ggerganov commented Dec 13, 2023

FSSRepo commented Dec 13, 2023

ggerganov commented Dec 13, 2023

paulocoutinhox commented Jun 1, 2024

ggerganov commented Dec 13, 2023 •

edited

Loading

FSSRepo commented Dec 13, 2023 •

edited

Loading