Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with Metal and iOS #647

Closed
paulocoutinhox opened this issue Dec 13, 2023 · 16 comments
Closed

Problems with Metal and iOS #647

paulocoutinhox opened this issue Dec 13, 2023 · 16 comments

Comments

@paulocoutinhox
Copy link

Hi,

We have a problem with Metal in iOS:

-[MTLDebugDevice newBufferWithBytesNoCopy:length:options:deallocator:]:700: failed assertion `Buffer Validation
newBufferWith*:length 0x6692c000 must not exceed 1024 MB.

Reference:
leejet/stable-diffusion.cpp#108

Can anyone help us with this?

Thanks.

@paulocoutinhox
Copy link
Author

Hi @ggerganov, I know you must be busy with a lot of things. Is there any way to take a look at this problem? Thank you very much.

@kikivg
Copy link

kikivg commented Jan 9, 2024

I have the same issue, and based on the discussion referenced above, it seems this is possible by using multiple buffers.
@ggerganov Is it possible to add that into the ggml_backend_metal_buffer_type_alloc_buffer?

@ggerganov
Copy link
Owner

Somehow I've completely missed this issue - apologies

Does it work if you patch ggml-metal.m something like this, in order to hardcode 1GB limit:

diff --git a/src/ggml-metal.m b/src/ggml-metal.m
index 6c2a8d0..87ee5bd 100644
--- a/src/ggml-metal.m
+++ b/src/ggml-metal.m
@@ -741,8 +741,10 @@ bool ggml_metal_add_buffer(
             size_aligned += (size_page - (size_aligned % size_page));
         }
 
+        const size_t maxBufferLength = 1024 * 1024 * 1024; // 1 GiB
+
         // the buffer fits into the max buffer size allowed by the device
-        if (size_aligned <= ctx->device.maxBufferLength) {
+        if (size_aligned <= maxBufferLength) {
             ctx->buffers[ctx->n_buffers].name = name;
             ctx->buffers[ctx->n_buffers].data = data;
             ctx->buffers[ctx->n_buffers].size = size;
@@ -761,8 +763,8 @@ bool ggml_metal_add_buffer(
             // this overlap between the views will guarantee that the tensor with the maximum size will fully fit into
             // one of the views
             const size_t size_ovlp = ((max_size + size_page - 1) / size_page + 1) * size_page; // round-up 2 pages just in case
-            const size_t size_step = ctx->device.maxBufferLength - size_ovlp;
-            const size_t size_view = ctx->device.maxBufferLength;
+            const size_t size_step = maxBufferLength - size_ovlp;
+            const size_t size_view = maxBufferLength;
 
             for (size_t i = 0; i < size; i += size_step) {
                 const size_t size_step_aligned = (i + size_view <= size) ? size_view : (size_aligned - i);
@@ -2648,8 +2650,10 @@ ggml_backend_buffer_t ggml_backend_metal_buffer_from_ptr(void * data, size_t siz
 
     id<MTLDevice> device = ggml_backend_metal_get_device();
 
+    const size_t maxBufferLength = 1024 * 1024 * 1024; // 1 GiB
+
     // the buffer fits into the max buffer size allowed by the device
-    if (size_aligned <= device.maxBufferLength) {
+    if (size_aligned <= maxBufferLength) {
         ctx->buffers[ctx->n_buffers].data = data;
         ctx->buffers[ctx->n_buffers].size = size;
 
@@ -2667,8 +2671,8 @@ ggml_backend_buffer_t ggml_backend_metal_buffer_from_ptr(void * data, size_t siz
         // this overlap between the views will guarantee that the tensor with the maximum size will fully fit into
         // one of the views
         const size_t size_ovlp = ((max_size + size_page - 1) / size_page + 1) * size_page; // round-up 2 pages just in case
-        const size_t size_step = device.maxBufferLength - size_ovlp;
-        const size_t size_view = device.maxBufferLength;
+        const size_t size_step = maxBufferLength - size_ovlp;
+        const size_t size_view = maxBufferLength;
 
         for (size_t i = 0; i < size; i += size_step) {
             const size_t size_step_aligned = (i + size_view <= size) ? size_view : (size_aligned - i);

@slaren
Copy link
Collaborator

slaren commented Jan 9, 2024

ggml-backend only attempts to split the buffer in ggml_backend_metal_buffer_from_ptr, which generally is meant to be used with mmap. I don't think we can do this for general buffers because the max tensor size is not known at the time the buffer is allocated. My suggestion is to add a function to ggml_backend_buffer_type to obtain the maximum buffer size, and modify ggml_backend_alloc_ctx_tensors to allocate multiple buffers if necessary.
leejet/stable-diffusion.cpp#108 (comment)

@paulocoutinhox
Copy link
Author

This is more serious now, because is affecting other libs, like whisper, since now we are using same ggml repository:

whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1024
whisper_model_load: n_audio_head  = 16
whisper_model_load: n_audio_layer = 24
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1024
whisper_model_load: n_text_head   = 16
whisper_model_load: n_text_layer  = 24
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 4 (medium)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs       = 99
whisper_backend_init: using Metal backend
ggml_metal_init: allocating
Metal GPU Frame Capture Enabled (/System/Library/PrivateFrameworks/GPUToolsCapture.framework/GPUToolsCapture)
Metal API Validation Enabled
ggml_metal_init: picking default device: Apple A15 GPU
ggml_metal_init: loading '/private/var/containers/Bundle/Application/2CD7BEB0-98E5-44F1-908C-6F1D0D5FBFA9/AiKit.app/default.metallib'
-[MTLDebugDevice newBufferWithBytesNoCopy:length:options:deallocator:]:700: failed assertion `Buffer Validation
newBufferWith*:length 0x5b67c000 must not exceed 1024 MB.
'

@paulocoutinhox
Copy link
Author

paulocoutinhox commented Jan 10, 2024

I create the patch, but it don't solve, problem still the same.

diff --git a/vendor/ggml/src/ggml-metal.m b/vendor/ggml/src/ggml-metal.m
index a17ef46..292dd44 100644
--- a/vendor/ggml/src/ggml-metal.m
+++ b/vendor/ggml/src/ggml-metal.m
@@ -742,7 +742,9 @@ bool ggml_metal_add_buffer(
         }
 
         // the buffer fits into the max buffer size allowed by the device
-        if (size_aligned <= ctx->device.maxBufferLength) {
+        const size_t maxBufferLength = 1024 * 1024 * 1024; // 1 GiB
+
+        if (size_aligned <= maxBufferLength) {
             ctx->buffers[ctx->n_buffers].name = name;
             ctx->buffers[ctx->n_buffers].data = data;
             ctx->buffers[ctx->n_buffers].size = size;
@@ -761,8 +763,8 @@ bool ggml_metal_add_buffer(
             // this overlap between the views will guarantee that the tensor with the maximum size will fully fit into
             // one of the views
             const size_t size_ovlp = ((max_size + size_page - 1) / size_page + 1) * size_page; // round-up 2 pages just in case
-            const size_t size_step = ctx->device.maxBufferLength - size_ovlp;
-            const size_t size_view = ctx->device.maxBufferLength;
+            const size_t size_step = maxBufferLength - size_ovlp;
+            const size_t size_view = maxBufferLength;
 
             for (size_t i = 0; i < size; i += size_step) {
                 const size_t size_step_aligned = (i + size_view <= size) ? size_view : (size_aligned - i);
@@ -2653,7 +2655,9 @@ ggml_backend_buffer_t ggml_backend_metal_buffer_from_ptr(void * data, size_t siz
     id<MTLDevice> device = ggml_backend_metal_get_device();
 
     // the buffer fits into the max buffer size allowed by the device
-    if (size_aligned <= device.maxBufferLength) {
+    const size_t maxBufferLength = 1024 * 1024 * 1024; // 1 GiB
+
+    if (size_aligned <= maxBufferLength) {
         ctx->buffers[ctx->n_buffers].data = data;
         ctx->buffers[ctx->n_buffers].size = size;
 
@@ -2671,8 +2675,8 @@ ggml_backend_buffer_t ggml_backend_metal_buffer_from_ptr(void * data, size_t siz
         // this overlap between the views will guarantee that the tensor with the maximum size will fully fit into
         // one of the views
         const size_t size_ovlp = ((max_size + size_page - 1) / size_page + 1) * size_page; // round-up 2 pages just in case
-        const size_t size_step = device.maxBufferLength - size_ovlp;
-        const size_t size_view = device.maxBufferLength;
+        const size_t size_step = maxBufferLength - size_ovlp;
+        const size_t size_view = maxBufferLength;
 
         for (size_t i = 0; i < size; i += size_step) {
             const size_t size_step_aligned = (i + size_view <= size) ? size_view : (size_aligned - i);

@kikivg
Copy link

kikivg commented Jan 13, 2024

I'm also seeing this in whisper.cpp when using it on iOS. @ggerganov is it possible to fix this and to allocate multiple buffers like @slaren suggested?

@ggerganov
Copy link
Owner

I'll take a look now

@ggerganov
Copy link
Owner

Here is a fix for whisper.cpp: ggerganov/whisper.cpp#1763

The general purpose solution for automatically splitting buffers in ggml-backend will be implemented in the future. For now I think this is a relatively simple approach that can be applied in other projects such as sd.cpp.

Let me know if you give it a try and confirm that it works. Will merge it later today

@kikivg
Copy link

kikivg commented Jan 13, 2024

I tested the fix for whisper.cpp and it works! Thanks.

@paulocoutinhox
Copy link
Author

I open an issue there to try get some help to fix:
leejet/stable-diffusion.cpp#149

@paulocoutinhox
Copy link
Author

Hi @ggerganov,

Can you help us with a general solution for this problem?

Thanks.

@slaren
Copy link
Collaborator

slaren commented Jan 29, 2024

A fix for this has been merged in llama.cpp (ggerganov/llama.cpp#5181). Applications that use ggml_backend_alloc_ctx_tensors will get the fix automatically without any changes, and applications that manage the allocations themselves can use ggml_backend_buft_get_max_size to obtain the maximum buffer size and split the model tensors into multiple buffers accordingly.

@paulocoutinhox
Copy link
Author

paulocoutinhox commented Feb 24, 2024

It was added to ggml and whisper.cpp too?

My project use ggml, whisper.cpp and llama.cpp together.

@slaren
Copy link
Collaborator

slaren commented Feb 24, 2024

Yes, ggml_backend_alloc_ctx_tensors is used both in llama.cpp and whisper.cpp.

This issue is solved in ggml, and all that we can do at this point is wait for 3rd party applications to update their model loading code to take advantage of this, so I am closing this issue.

@slaren slaren closed this as completed Feb 24, 2024
@paulocoutinhox
Copy link
Author

Pls @leejet add support for this in the last GGML version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants