-
Notifications
You must be signed in to change notification settings - Fork 941
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A way to automatically calculate mem_size for creating ggml context #121
Comments
in llama.cpp there is an option to experimentally observe the memory usage |
Looks related: ggerganov/llama.cpp#1152 @Green-Sky This may be useful, but after observing memory usage, this value needs to be hardcoded; and this needs to be done for each model size. Furthermore, the specific value may not be portable across machines/platforms. |
It should be possible to utilize the Line 361 in 9d7974c
For example, do a first pass with |
It's not a complete solution, but #123 could help a lot here. If operations like tensor creation failure were recoverable, you could just free the current context and try again with a somewhat larger size when you run out of memory. Building the graph is pretty cheap, so it shouldn't be a huge issue if it's necessary to retry a couple times. Some estimation/waste of memory is probably still necessary but the status quo right now is get it right the first time or die horribly.
If I'm understanding correctly, you're saying I assumed it was related to preventing GGML from allocating its context memory itself and allow other code to pass in a buffer for it to use. |
As far as I understand, before creating ggml context, we need to know exactly how much memory is needed, and pass this value into
ggml_init
throughggml_init_params.mem_size
.On the one hand, it's great that this serves as a guarantee that ggml will not allocate any more memory -- if we specified, say, 100 MB, we can be sure that no more than 100 MB would be allocated.
On the other hand, computation of
mem_size
may not be trivial for large models that use wide range of operations. These operations create intermediate tensors; and operations like matmul require some extra temporary space (for example, to quantize/dequantize arguments). This extra space depends on thread count, CPU features supported, libraries used, etc.In
rwkv.cpp
, I used a hacky workaround which combines some heuristics and good old "just add more MBs until it starts working":I would like to avoid:
ggml
temporary space allocationsmem_size
value for every combination of model size and thread countIdeally, there could be some function like
ggml_trace
, which would take a graph, traverse it and tell precisely how much memory would be needed to evaluate it. Problem here is that we already need a context to create the graph.Looks like in llama.cpp values are hardcoded.
The text was updated successfully, but these errors were encountered: