Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Estimate memory requirements for graph #260

Open
smspillaz opened this issue Jun 15, 2023 · 9 comments
Open

Estimate memory requirements for graph #260

smspillaz opened this issue Jun 15, 2023 · 9 comments

Comments

@smspillaz
Copy link
Contributor

smspillaz commented Jun 15, 2023

This is sort of in a similar light to #214 , but a bit more general.

It would be useful to be able to estimate the total context memory requirement given some computation graph or a list of tensor descriptions. This would make the implementation of newer models that much easier, since the implementer doesn't need to estimate all the memory usage manually.

For computation graphs, this wouldn't be more overhead as long as the computation graph size was constant between invocations. In that case the context's memory buffer can be re-used (I've successfully done this for GPT2 in https://github.com/smspillaz/ggml-gobject).

I think in order to implement this, you could have a flag on ggml_context such that when new tensors are created in that context, they don't actually allocate any memory for the data (the object overhead can either go into its own memory pool or on to the stack/heap). Writing to the tensors would be a no-op, as well as ggml_graph_compute. Once the computation graph has been created, then the library consumer can query the context's estimated memory usage, which could be done by walking all the objects in the ggml_object list and tallying up their sizes.

I haven't looked very closely at the details - maybe data allocations are needed in order to build the graph somehow which would make this infeasible. But if not, I could try doing this myself and submitting a pull request, if it belongs in the library.

@LoganDark
Copy link
Contributor

LoganDark commented Jun 16, 2023

this is mostly possible if you don't mind reading the implementation of every function to figure out exactly what it does:

https://github.com/saharNooby/rwkv.cpp/blob/6b26e0db28b26f0fb2c73c5aa6ff490818fb1456/rwkv.cpp#L942-L958

https://github.com/saharNooby/rwkv.cpp/blob/6b26e0db28b26f0fb2c73c5aa6ff490818fb1456/rwkv.cpp#L505-L519

@ggerganov
Copy link
Owner

Yes, this is annoying currently that you have to pre-compute the necessary size. I'm thinking about ways to solve this.
The proposed solution is one way to do it. Will try to prioritize this feature soon

@LoganDark
Copy link
Contributor

this is mostly possible if you don't mind reading the implementation of every function to figure out exactly what it does:

https://github.com/saharNooby/rwkv.cpp/blob/6b26e0db28b26f0fb2c73c5aa6ff490818fb1456/rwkv.cpp#L942-L958

https://github.com/saharNooby/rwkv.cpp/blob/6b26e0db28b26f0fb2c73c5aa6ff490818fb1456/rwkv.cpp#L505-L519

the latest version of GGML trashed this so severely (WHY do ggml_views allocate ANOTHER extra tensor now??) that I'm going to have to redo the entire system, so that's fun

@smspillaz
Copy link
Contributor Author

Was just wondering if there was any update on this - I can also start looking into this myself

@slaren
Copy link
Collaborator

slaren commented Aug 3, 2023

There is an implementation in llama.cpp that does this, among other things. It is not entirely automated as you are suggesting here, you have to avoid writing to the tensors while creating a dummy graph for measuring the memory requirements.
ggerganov/llama.cpp#2411

@smspillaz
Copy link
Contributor Author

smspillaz commented Aug 4, 2023

OK, perhaps I can try to backport ggerganov/llama.cpp#2411 to here?

@smspillaz
Copy link
Contributor Author

smspillaz commented Aug 4, 2023

Created #433 . I want to try and implement this in ggml-gobject as well, just to test that it works correctly (no reason why it shouldn't, since the allocator parts are relatively standalone)

@Green-Sky
Copy link
Contributor

@ggerganov occasionally syncs the ggml code in ggml/whisper.cpp/llama.cpp, i suppose you just have to poke him, and he will do it... sometime he has time :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants