Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A way to automatically calculate mem_size for creating ggml context #121

Open
saharNooby opened this issue Apr 30, 2023 · 4 comments
Open

Comments

@saharNooby
Copy link

As far as I understand, before creating ggml context, we need to know exactly how much memory is needed, and pass this value into ggml_init through ggml_init_params.mem_size.

On the one hand, it's great that this serves as a guarantee that ggml will not allocate any more memory -- if we specified, say, 100 MB, we can be sure that no more than 100 MB would be allocated.

On the other hand, computation of mem_size may not be trivial for large models that use wide range of operations. These operations create intermediate tensors; and operations like matmul require some extra temporary space (for example, to quantize/dequantize arguments). This extra space depends on thread count, CPU features supported, libraries used, etc.

In rwkv.cpp, I used a hacky workaround which combines some heuristics and good old "just add more MBs until it starts working":

    size_t memory_required = file_size +
        // Intermediary vectors for calculation; there are around 100 calls to ggml
        size_t(100) * model->n_embed * sizeof(float) +
        // State, in and out
        size_t(2) * 5 * model->n_layer * model->n_embed * sizeof(float) +
        // Logits
        size_t(model->n_vocab) * sizeof(float) +
        // +256 MB just for any overhead
        // TODO This is too much for smaller models; need a more proper and robust way of measuring required memory
        size_t(256) * 1024 * 1024;

I would like to avoid:

  • having extra 256 MB of unconditional padding (it just wastes memory for smaller models)
  • having to write complicated calculations that duplicate internal logic of ggml temporary space allocations
  • having to hardcode mem_size value for every combination of model size and thread count

Ideally, there could be some function like ggml_trace, which would take a graph, traverse it and tell precisely how much memory would be needed to evaluate it. Problem here is that we already need a context to create the graph.

Looks like in llama.cpp values are hardcoded.

@Green-Sky
Copy link
Contributor

in llama.cpp there is an option to experimentally observe the memory usage
https://github.com/ggerganov/llama.cpp/blob/3e5aa8a1c44051153d6d7b3eeca2f4b4e5fb310c/examples/main/main.cpp#L138-L155

@saharNooby
Copy link
Author

Looks related: ggerganov/llama.cpp#1152

@Green-Sky This may be useful, but after observing memory usage, this value needs to be hardcoded; and this needs to be done for each model size. Furthermore, the specific value may not be portable across machines/platforms.

@ggerganov
Copy link
Owner

It should be possible to utilize the no_alloc parameter of the ggml_context to achieve automatic size calculation:

bool no_alloc; // don't allocate memory for the tensor data

For example, do a first pass with no_alloc = true to determine the necessary size and then do a second pass with no_alloc = false. But maybe there is a better way - not sure yet.

@KerfuffleV2
Copy link

KerfuffleV2 commented May 1, 2023

Ideally, there could be some function like ggml_trace, which would take a graph, traverse it and tell precisely how much memory would be needed to evaluate it. Problem here is that we already need a context to create the graph.

It's not a complete solution, but #123 could help a lot here. If operations like tensor creation failure were recoverable, you could just free the current context and try again with a somewhat larger size when you run out of memory. Building the graph is pretty cheap, so it shouldn't be a huge issue if it's necessary to retry a couple times.

Some estimation/waste of memory is probably still necessary but the status quo right now is get it right the first time or die horribly.


It should be possible to utilize the no_alloc parameter

If I'm understanding correctly, you're saying no_alloc means you can create the graph without allocating data for the tensors: only the objects? (Obviously it can't be evaluated that way...)

I assumed it was related to preventing GGML from allocating its context memory itself and allow other code to pass in a buffer for it to use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants