Problems with Metal and iOS #647

paulocoutinhox · 2023-12-13T02:14:16Z

Hi,

We have a problem with Metal in iOS:

-[MTLDebugDevice newBufferWithBytesNoCopy:length:options:deallocator:]:700: failed assertion `Buffer Validation
newBufferWith*:length 0x6692c000 must not exceed 1024 MB.

Reference:
leejet/stable-diffusion.cpp#108

Can anyone help us with this?

Thanks.

The text was updated successfully, but these errors were encountered:

paulocoutinhox · 2023-12-28T06:42:18Z

Hi @ggerganov, I know you must be busy with a lot of things. Is there any way to take a look at this problem? Thank you very much.

kikivg · 2024-01-09T17:45:55Z

I have the same issue, and based on the discussion referenced above, it seems this is possible by using multiple buffers.
@ggerganov Is it possible to add that into the ggml_backend_metal_buffer_type_alloc_buffer?

ggerganov · 2024-01-09T17:55:06Z

Somehow I've completely missed this issue - apologies

Does it work if you patch ggml-metal.m something like this, in order to hardcode 1GB limit:

diff --git a/src/ggml-metal.m b/src/ggml-metal.m
index 6c2a8d0..87ee5bd 100644
--- a/src/ggml-metal.m
+++ b/src/ggml-metal.m
@@ -741,8 +741,10 @@ bool ggml_metal_add_buffer(
             size_aligned += (size_page - (size_aligned % size_page));
         }
 
+        const size_t maxBufferLength = 1024 * 1024 * 1024; // 1 GiB
+
         // the buffer fits into the max buffer size allowed by the device
-        if (size_aligned <= ctx->device.maxBufferLength) {
+        if (size_aligned <= maxBufferLength) {
             ctx->buffers[ctx->n_buffers].name = name;
             ctx->buffers[ctx->n_buffers].data = data;
             ctx->buffers[ctx->n_buffers].size = size;
@@ -761,8 +763,8 @@ bool ggml_metal_add_buffer(
             // this overlap between the views will guarantee that the tensor with the maximum size will fully fit into
             // one of the views
             const size_t size_ovlp = ((max_size + size_page - 1) / size_page + 1) * size_page; // round-up 2 pages just in case
-            const size_t size_step = ctx->device.maxBufferLength - size_ovlp;
-            const size_t size_view = ctx->device.maxBufferLength;
+            const size_t size_step = maxBufferLength - size_ovlp;
+            const size_t size_view = maxBufferLength;
 
             for (size_t i = 0; i < size; i += size_step) {
                 const size_t size_step_aligned = (i + size_view <= size) ? size_view : (size_aligned - i);
@@ -2648,8 +2650,10 @@ ggml_backend_buffer_t ggml_backend_metal_buffer_from_ptr(void * data, size_t siz
 
     id<MTLDevice> device = ggml_backend_metal_get_device();
 
+    const size_t maxBufferLength = 1024 * 1024 * 1024; // 1 GiB
+
     // the buffer fits into the max buffer size allowed by the device
-    if (size_aligned <= device.maxBufferLength) {
+    if (size_aligned <= maxBufferLength) {
         ctx->buffers[ctx->n_buffers].data = data;
         ctx->buffers[ctx->n_buffers].size = size;
 
@@ -2667,8 +2671,8 @@ ggml_backend_buffer_t ggml_backend_metal_buffer_from_ptr(void * data, size_t siz
         // this overlap between the views will guarantee that the tensor with the maximum size will fully fit into
         // one of the views
         const size_t size_ovlp = ((max_size + size_page - 1) / size_page + 1) * size_page; // round-up 2 pages just in case
-        const size_t size_step = device.maxBufferLength - size_ovlp;
-        const size_t size_view = device.maxBufferLength;
+        const size_t size_step = maxBufferLength - size_ovlp;
+        const size_t size_view = maxBufferLength;
 
         for (size_t i = 0; i < size; i += size_step) {
             const size_t size_step_aligned = (i + size_view <= size) ? size_view : (size_aligned - i);

slaren · 2024-01-09T18:02:43Z

ggml-backend only attempts to split the buffer in ggml_backend_metal_buffer_from_ptr, which generally is meant to be used with mmap. I don't think we can do this for general buffers because the max tensor size is not known at the time the buffer is allocated. My suggestion is to add a function to ggml_backend_buffer_type to obtain the maximum buffer size, and modify ggml_backend_alloc_ctx_tensors to allocate multiple buffers if necessary.
leejet/stable-diffusion.cpp#108 (comment)

paulocoutinhox · 2024-01-10T16:46:30Z

This is more serious now, because is affecting other libs, like whisper, since now we are using same ggml repository:

whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1024
whisper_model_load: n_audio_head  = 16
whisper_model_load: n_audio_layer = 24
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1024
whisper_model_load: n_text_head   = 16
whisper_model_load: n_text_layer  = 24
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 4 (medium)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs       = 99
whisper_backend_init: using Metal backend
ggml_metal_init: allocating
Metal GPU Frame Capture Enabled (/System/Library/PrivateFrameworks/GPUToolsCapture.framework/GPUToolsCapture)
Metal API Validation Enabled
ggml_metal_init: picking default device: Apple A15 GPU
ggml_metal_init: loading '/private/var/containers/Bundle/Application/2CD7BEB0-98E5-44F1-908C-6F1D0D5FBFA9/AiKit.app/default.metallib'
-[MTLDebugDevice newBufferWithBytesNoCopy:length:options:deallocator:]:700: failed assertion `Buffer Validation
newBufferWith*:length 0x5b67c000 must not exceed 1024 MB.
'

paulocoutinhox · 2024-01-10T17:01:50Z

I create the patch, but it don't solve, problem still the same.

diff --git a/vendor/ggml/src/ggml-metal.m b/vendor/ggml/src/ggml-metal.m
index a17ef46..292dd44 100644
--- a/vendor/ggml/src/ggml-metal.m
+++ b/vendor/ggml/src/ggml-metal.m
@@ -742,7 +742,9 @@ bool ggml_metal_add_buffer(
         }
 
         // the buffer fits into the max buffer size allowed by the device
-        if (size_aligned <= ctx->device.maxBufferLength) {
+        const size_t maxBufferLength = 1024 * 1024 * 1024; // 1 GiB
+
+        if (size_aligned <= maxBufferLength) {
             ctx->buffers[ctx->n_buffers].name = name;
             ctx->buffers[ctx->n_buffers].data = data;
             ctx->buffers[ctx->n_buffers].size = size;
@@ -761,8 +763,8 @@ bool ggml_metal_add_buffer(
             // this overlap between the views will guarantee that the tensor with the maximum size will fully fit into
             // one of the views
             const size_t size_ovlp = ((max_size + size_page - 1) / size_page + 1) * size_page; // round-up 2 pages just in case
-            const size_t size_step = ctx->device.maxBufferLength - size_ovlp;
-            const size_t size_view = ctx->device.maxBufferLength;
+            const size_t size_step = maxBufferLength - size_ovlp;
+            const size_t size_view = maxBufferLength;
 
             for (size_t i = 0; i < size; i += size_step) {
                 const size_t size_step_aligned = (i + size_view <= size) ? size_view : (size_aligned - i);
@@ -2653,7 +2655,9 @@ ggml_backend_buffer_t ggml_backend_metal_buffer_from_ptr(void * data, size_t siz
     id<MTLDevice> device = ggml_backend_metal_get_device();
 
     // the buffer fits into the max buffer size allowed by the device
-    if (size_aligned <= device.maxBufferLength) {
+    const size_t maxBufferLength = 1024 * 1024 * 1024; // 1 GiB
+
+    if (size_aligned <= maxBufferLength) {
         ctx->buffers[ctx->n_buffers].data = data;
         ctx->buffers[ctx->n_buffers].size = size;
 
@@ -2671,8 +2675,8 @@ ggml_backend_buffer_t ggml_backend_metal_buffer_from_ptr(void * data, size_t siz
         // this overlap between the views will guarantee that the tensor with the maximum size will fully fit into
         // one of the views
         const size_t size_ovlp = ((max_size + size_page - 1) / size_page + 1) * size_page; // round-up 2 pages just in case
-        const size_t size_step = device.maxBufferLength - size_ovlp;
-        const size_t size_view = device.maxBufferLength;
+        const size_t size_step = maxBufferLength - size_ovlp;
+        const size_t size_view = maxBufferLength;
 
         for (size_t i = 0; i < size; i += size_step) {
             const size_t size_step_aligned = (i + size_view <= size) ? size_view : (size_aligned - i);

kikivg · 2024-01-13T11:52:01Z

I'm also seeing this in whisper.cpp when using it on iOS. @ggerganov is it possible to fix this and to allocate multiple buffers like @slaren suggested?

ggerganov · 2024-01-13T12:39:21Z

I'll take a look now

ggerganov · 2024-01-13T13:49:20Z

Here is a fix for whisper.cpp: ggerganov/whisper.cpp#1763

The general purpose solution for automatically splitting buffers in ggml-backend will be implemented in the future. For now I think this is a relatively simple approach that can be applied in other projects such as sd.cpp.

Let me know if you give it a try and confirm that it works. Will merge it later today

kikivg · 2024-01-13T14:37:47Z

I tested the fix for whisper.cpp and it works! Thanks.

paulocoutinhox · 2024-01-15T15:39:31Z

I open an issue there to try get some help to fix:
leejet/stable-diffusion.cpp#149

paulocoutinhox · 2024-01-22T18:01:57Z

Hi @ggerganov,

Can you help us with a general solution for this problem?

Thanks.

slaren · 2024-01-29T20:56:54Z

A fix for this has been merged in llama.cpp (ggerganov/llama.cpp#5181). Applications that use ggml_backend_alloc_ctx_tensors will get the fix automatically without any changes, and applications that manage the allocations themselves can use ggml_backend_buft_get_max_size to obtain the maximum buffer size and split the model tensors into multiple buffers accordingly.

paulocoutinhox · 2024-02-24T09:25:31Z

It was added to ggml and whisper.cpp too?

My project use ggml, whisper.cpp and llama.cpp together.

slaren · 2024-02-24T23:42:29Z

Yes, ggml_backend_alloc_ctx_tensors is used both in llama.cpp and whisper.cpp.

This issue is solved in ggml, and all that we can do at this point is wait for 3rd party applications to update their model loading code to take advantage of this, so I am closing this issue.

paulocoutinhox · 2024-02-25T07:23:57Z

Pls @leejet add support for this in the last GGML version.

paulocoutinhox mentioned this issue Dec 14, 2023

stable-diffusion: implement ESRGAN upscaler + Metal Backend leejet/stable-diffusion.cpp#104

Merged

ggerganov mentioned this issue Jan 13, 2024

whisper : load the model into multiple buffers of max size 1GB ggerganov/whisper.cpp#1763

Merged

paulocoutinhox mentioned this issue Jan 15, 2024

Support for splitting buffers M1/M2/M3 leejet/stable-diffusion.cpp#149

Open

slaren closed this as completed Feb 24, 2024

paulocoutinhox mentioned this issue Feb 25, 2024

Error when use with latest GGML leejet/stable-diffusion.cpp#177

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems with Metal and iOS #647

Problems with Metal and iOS #647

paulocoutinhox commented Dec 13, 2023

paulocoutinhox commented Dec 28, 2023

kikivg commented Jan 9, 2024

ggerganov commented Jan 9, 2024

slaren commented Jan 9, 2024

paulocoutinhox commented Jan 10, 2024

paulocoutinhox commented Jan 10, 2024 •

edited

Loading

kikivg commented Jan 13, 2024

ggerganov commented Jan 13, 2024

ggerganov commented Jan 13, 2024

kikivg commented Jan 13, 2024

paulocoutinhox commented Jan 15, 2024

paulocoutinhox commented Jan 22, 2024

slaren commented Jan 29, 2024 •

edited

Loading

paulocoutinhox commented Feb 24, 2024 •

edited

Loading

slaren commented Feb 24, 2024

paulocoutinhox commented Feb 25, 2024

Problems with Metal and iOS #647

Problems with Metal and iOS #647

Comments

paulocoutinhox commented Dec 13, 2023

paulocoutinhox commented Dec 28, 2023

kikivg commented Jan 9, 2024

ggerganov commented Jan 9, 2024

slaren commented Jan 9, 2024

paulocoutinhox commented Jan 10, 2024

paulocoutinhox commented Jan 10, 2024 • edited Loading

kikivg commented Jan 13, 2024

ggerganov commented Jan 13, 2024

ggerganov commented Jan 13, 2024

kikivg commented Jan 13, 2024

paulocoutinhox commented Jan 15, 2024

paulocoutinhox commented Jan 22, 2024

slaren commented Jan 29, 2024 • edited Loading

paulocoutinhox commented Feb 24, 2024 • edited Loading

slaren commented Feb 24, 2024

paulocoutinhox commented Feb 25, 2024

paulocoutinhox commented Jan 10, 2024 •

edited

Loading

slaren commented Jan 29, 2024 •

edited

Loading

paulocoutinhox commented Feb 24, 2024 •

edited

Loading