A way to automatically calculate mem_size for creating ggml context #121

saharNooby · 2023-04-30T11:19:53Z

As far as I understand, before creating ggml context, we need to know exactly how much memory is needed, and pass this value into ggml_init through ggml_init_params.mem_size.

On the one hand, it's great that this serves as a guarantee that ggml will not allocate any more memory -- if we specified, say, 100 MB, we can be sure that no more than 100 MB would be allocated.

On the other hand, computation of mem_size may not be trivial for large models that use wide range of operations. These operations create intermediate tensors; and operations like matmul require some extra temporary space (for example, to quantize/dequantize arguments). This extra space depends on thread count, CPU features supported, libraries used, etc.

In rwkv.cpp, I used a hacky workaround which combines some heuristics and good old "just add more MBs until it starts working":

    size_t memory_required = file_size +
        // Intermediary vectors for calculation; there are around 100 calls to ggml
        size_t(100) * model->n_embed * sizeof(float) +
        // State, in and out
        size_t(2) * 5 * model->n_layer * model->n_embed * sizeof(float) +
        // Logits
        size_t(model->n_vocab) * sizeof(float) +
        // +256 MB just for any overhead
        // TODO This is too much for smaller models; need a more proper and robust way of measuring required memory
        size_t(256) * 1024 * 1024;

I would like to avoid:

having extra 256 MB of unconditional padding (it just wastes memory for smaller models)
having to write complicated calculations that duplicate internal logic of ggml temporary space allocations
having to hardcode mem_size value for every combination of model size and thread count

Ideally, there could be some function like ggml_trace, which would take a graph, traverse it and tell precisely how much memory would be needed to evaluate it. Problem here is that we already need a context to create the graph.

Looks like in llama.cpp values are hardcoded.

The text was updated successfully, but these errors were encountered:

Green-Sky · 2023-04-30T11:39:27Z

in llama.cpp there is an option to experimentally observe the memory usage
https://github.com/ggerganov/llama.cpp/blob/3e5aa8a1c44051153d6d7b3eeca2f4b4e5fb310c/examples/main/main.cpp#L138-L155

saharNooby · 2023-04-30T11:47:44Z

Looks related: ggerganov/llama.cpp#1152

@Green-Sky This may be useful, but after observing memory usage, this value needs to be hardcoded; and this needs to be done for each model size. Furthermore, the specific value may not be portable across machines/platforms.

ggerganov · 2023-04-30T11:50:24Z

It should be possible to utilize the no_alloc parameter of the ggml_context to achieve automatic size calculation:

ggml/include/ggml/ggml.h

Line 361 in 9d7974c

bool no_alloc; // don't allocate memory for the tensor data

For example, do a first pass with no_alloc = true to determine the necessary size and then do a second pass with no_alloc = false. But maybe there is a better way - not sure yet.

KerfuffleV2 · 2023-05-01T14:57:36Z

Ideally, there could be some function like ggml_trace, which would take a graph, traverse it and tell precisely how much memory would be needed to evaluate it. Problem here is that we already need a context to create the graph.

It's not a complete solution, but #123 could help a lot here. If operations like tensor creation failure were recoverable, you could just free the current context and try again with a somewhat larger size when you run out of memory. Building the graph is pretty cheap, so it shouldn't be a huge issue if it's necessary to retry a couple times.

Some estimation/waste of memory is probably still necessary but the status quo right now is get it right the first time or die horribly.

It should be possible to utilize the no_alloc parameter

If I'm understanding correctly, you're saying no_alloc means you can create the graph without allocating data for the tensors: only the objects? (Obviously it can't be evaluated that way...)

I assumed it was related to preventing GGML from allocating its context memory itself and allow other code to pass in a buffer for it to use.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A way to automatically calculate mem_size for creating ggml context #121

A way to automatically calculate mem_size for creating ggml context #121

saharNooby commented Apr 30, 2023

Green-Sky commented Apr 30, 2023

saharNooby commented Apr 30, 2023

ggerganov commented Apr 30, 2023

KerfuffleV2 commented May 1, 2023 •

edited

Loading

A way to automatically calculate mem_size for creating ggml context #121

A way to automatically calculate mem_size for creating ggml context #121

Comments

saharNooby commented Apr 30, 2023

Green-Sky commented Apr 30, 2023

saharNooby commented Apr 30, 2023

ggerganov commented Apr 30, 2023

KerfuffleV2 commented May 1, 2023 • edited Loading

KerfuffleV2 commented May 1, 2023 •

edited

Loading