-
Notifications
You must be signed in to change notification settings - Fork 959
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Estimate memory requirements for graph #260
Comments
this is mostly possible if you don't mind reading the implementation of every function to figure out exactly what it does: |
Yes, this is annoying currently that you have to pre-compute the necessary size. I'm thinking about ways to solve this. |
the latest version of GGML trashed this so severely (WHY do |
Was just wondering if there was any update on this - I can also start looking into this myself |
There is an implementation in llama.cpp that does this, among other things. It is not entirely automated as you are suggesting here, you have to avoid writing to the tensors while creating a dummy graph for measuring the memory requirements. |
well, rwkv.cpp has a new implementation if you're interested that uses "future tensors"—basically predicting the amount of objects and memory that will be used by each tensor operation, and the prediction functions get quite a bit nicer: other than that, I have nothing >/ |
OK, perhaps I can try to backport ggerganov/llama.cpp#2411 to here? |
Created #433 . I want to try and implement this in ggml-gobject as well, just to test that it works correctly (no reason why it shouldn't, since the allocator parts are relatively standalone) |
@ggerganov occasionally syncs the ggml code in ggml/whisper.cpp/llama.cpp, i suppose you just have to poke him, and he will do it... sometime he has time :) |
This is sort of in a similar light to #214 , but a bit more general.
It would be useful to be able to estimate the total context memory requirement given some computation graph or a list of tensor descriptions. This would make the implementation of newer models that much easier, since the implementer doesn't need to estimate all the memory usage manually.
For computation graphs, this wouldn't be more overhead as long as the computation graph size was constant between invocations. In that case the context's memory buffer can be re-used (I've successfully done this for GPT2 in https://github.com/smspillaz/ggml-gobject).
I think in order to implement this, you could have a flag on
ggml_context
such that when new tensors are created in that context, they don't actually allocate any memory for the data (the object overhead can either go into its own memory pool or on to the stack/heap). Writing to the tensors would be a no-op, as well asggml_graph_compute
. Once the computation graph has been created, then the library consumer can query the context's estimated memory usage, which could be done by walking all the objects in theggml_object
list and tallying up their sizes.I haven't looked very closely at the details - maybe data allocations are needed in order to build the graph somehow which would make this infeasible. But if not, I could try doing this myself and submitting a pull request, if it belongs in the library.
The text was updated successfully, but these errors were encountered: