Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

65B model eventually fails with "ggml_new_tensor_impl: not enough space in the scratch memory" #1152

Closed
logicchains opened this issue Apr 24, 2023 · 2 comments
Labels
bug Something isn't working

Comments

@logicchains
Copy link
Contributor

I'm running the 65B model on a machine with 256 gigabytes of (CPU) ram, with context size set to 2048. The same thing happens with both llama65b and alpaca65b, every single time I run it in interactive mode: it works fine for a while, but eventually fails with:

ggml_new_tensor_impl: not enough space in the scratch memory
Segmentation fault (core dumped)

Maybe it's using up more and more ram over time, until it runs out?

The exact params:
llama_model_load_internal: format = ggjt v1 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 8192
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 64
llama_model_load_internal: n_layer = 80
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 4 (mostly Q4_1, some F16)
llama_model_load_internal: n_ff = 22016
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size = 146.86 KB
llama_model_load_internal: mem required = 41477.67 MB (+ 5120.00 MB per state)
llama_init_from_file: kv self size = 5120.00 MB

system_info: n_threads = 16 / 64 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0
|
main: interactive mode on.
sampling: temp = 1.000000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.000000
generate: n_ctx = 2048, n_batch = 8, n_predict = -1, n_keep = 0

@ggerganov ggerganov added the bug Something isn't working label Apr 24, 2023
@ggerganov
Copy link
Owner

ggerganov commented Apr 24, 2023

When the contexts swap occurs and it has to re-evaluate the second half of the context (i.e. n_ctx/2 = 1024 tokens), one of the "scratch" buffers runs out of memory.

The solution is:

  • Apply this patch
diff --git a/llama.cpp b/llama.cpp
index 8c1d657..e860ea1 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -54,7 +54,7 @@ static const std::map<e_model, size_t> & MEM_REQ_SCRATCH0()
         { MODEL_7B,    512ull * MB },
         { MODEL_13B,   512ull * MB },
         { MODEL_30B,   512ull * MB },
-        { MODEL_65B,   512ull * MB },
+        { MODEL_65B,  2048ull * MB },
     };
     return _MEM_REQ_SCRATCH0;
 }
@@ -65,7 +65,7 @@ static const std::map<e_model, size_t> & MEM_REQ_SCRATCH1()
         { MODEL_7B,    512ull * MB },
         { MODEL_13B,   512ull * MB },
         { MODEL_30B,   512ull * MB },
-        { MODEL_65B,   512ull * MB },
+        { MODEL_65B,  2048ull * MB },
     };
     return _MEM_REQ_SCRATCH1;
 }
@@ -1290,7 +1290,7 @@ static bool llama_eval_internal(
         mem_per_token = ggml_used_mem(ctx0)/N;
     }
 
-#if 0
+#if 1
     printf("\n%s: used_mem = %.3f MB, scratch -- %.3f MB %.3f MB\n", __func__,
             ggml_used_mem(ctx0)/1024.0/1024.0,
             lctx.get_buf_max_mem(0)/1024.0/1024.0,
  • Run the main example with a prompt of 1024 tokens (i.e. this should correspond to the worst case scenario)
  • Write down the output. For example:
    llama_eval_internal: used_mem = 508.284 MB, scratch -- 136.000 MB 134.000 MB
    
  • The last two numbers are the needed scratch buffer sizes. Revert the patch from the first step and update the respective numbers for the 65B model, putting a bit of extra size on top of the reported number just in case

It's a very sloppy process for determining the necessary scratch buffer size. Will try to improve this in the future.
While doing this, you can also do the same process for the other models and adjust the numbers down since now we are probably over-allocating some memory

P.S. I just bumped the buffers to 1GB for the 65B model to avoid this crash, but the correct solution from above has to be applied and the numbers re-adjusted

ggerganov added a commit that referenced this issue Apr 24, 2023
@logicchains
Copy link
Contributor Author

Thanks! Is there some way I can generate a prompt of exactly 1024 tokens? E.g. maybe some character sequence that I could repeat 1024 times?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants