ggml : improve API to allow allocating compute graphs on the heap #299

ggerganov · 2023-06-25T13:48:41Z

Currently, ggml forces the user to allocate the compute graphs on the stack. The ggml API should be extended to support using heap allocated graphs.

The text was updated successfully, but these errors were encountered:

xaedes · 2023-06-25T16:25:14Z

I am currently working on memory improvements for training and testing with bigger models than before.
When training big models the maximum node and parameter count needs to be increased.
At some point stack overflows happened due to big graph struct size...

To solve this I allocate heap memory for the graphs by using the data buffer of a new tensor and then use only build_expand functions instead of the regular build functions.
This required a new ggml_build_backward_expand function which takes the backward graph as pointer parameter, like the forward graph. ggml_build_backward just calls this function.

GGML_API void ggml_build_backward_expand(struct ggml_context * ctx, struct ggml_cgraph * gf, struct ggml_cgraph * gb, bool keep);

Example for allocating new graph:

struct ggml_tensor * gfbuf = ggml_new_tensor_1d(ctx0, GGML_TYPE_I32, sizeof(struct ggml_cgraph) / ggml_type_size(GGML_TYPE_I32) + (sizeof(struct ggml_cgraph) % ggml_type_size(GGML_TYPE_I32) ? 1 : 0));
memset(gfbuf->data, 0, ggml_nbytes(gfbuf));
struct ggml_cgraph * gf = (struct ggml_cgraph *) gfbuf->data;

This seems to be enough for solving the stackoverflows in training.

In llama.cpp inference code there are only a few locations where a cgraph is allocated on stack.
Replacing this occurences with heap allocated graphs and then just using ggml_build_forward_expand should be straight forward.

A function to directly allocate an ggml object with enough bytes from the context would be nice!
Making the new tensor and recasting of the tensor->data looks a bit hackish..

Maybe something like this:

GGML_API void * ggml_alloc(struct ggml_context * ctx, size_t nbytes);
...
struct ggml_cgraph * gf = ggml_alloc(ctx0, sizeof(struct ggml_cgraph));

I prefer allocating the memory from the context over just 'malloc', so that the machinery for freeing the context and all its related memory can be reused.

slaren · 2023-06-25T16:49:03Z

I think it is a good idea allocating it from the context memory, but it should be automatic. This could be done by passing a context to ggml_build_forward and either:

Counting the number of nodes in a first pass (this requires adding a visited flag to ggml_tensor, but that should be done for performance anyway) and allocating the exact amount of memory needed
Growing the buffer as needed from the ggml_context. Since it will be the last object in the ggml_context for the entire duration of the expansion, this can be done safely

To make this work, the graph must only be built in one step, but I think that ggml_build_forward_expand should be removed and instead replaced by making sure that all the dependencies are represented in the graph (see this for more details). Is there any case for which ggml_build_forward_expand is required that cannot be solved by adding a dependency to the graph?

ggerganov · 2023-07-02T14:49:46Z

Is there any case for which ggml_build_forward_expand is required that cannot be solved by adding a dependency to the graph?

Don't think there is such case. I agree we should aim to eliminate ggml_build_forward_expand() eventually

I think it is a good idea allocating it from the context memory, but it should be automatic.

Yes, agreed. The only problem might be that after ggerganov/llama.cpp#1999 there will be no longer an "eval" ggml_context as it will be replaced with ggml_cgraph_context - maybe we can instead pass ggml_cgraph_context to ggml_build_forward()

slaren · 2023-07-02T15:12:52Z

I don't think that's necessary. We can pass the same ggml_context used to allocate the tensors of the graph to ggml_build_forward. This still maintains a separation between building the graph and running it. This is what I am thinking:

ggml_tensor * output = ggml_mul(ctx, ...);
ggml_cgraph gf = ggml_build_forward(ctx, output); // gf will be allocated in ctx

// execute the graph on the CPU
ggml_cgraph_context plan = ggml_graph_compute_plan(&gf);
plan.work_data = malloc(plan.work_size);
ggml_graph_compute(plan);

// execute the same graph with CUDA
ggml_cuda_graph_context cuda_plan = ggml_cuda_compute_plan(&gf);
ggml_cuda_graph_compute(cuda_plan);

ggerganov · 2023-07-02T16:55:04Z

Ah correct - I was a bit confused. Ignore my last comment

slaren · 2023-07-25T16:11:14Z

@ggerganov should we remove ggml_object? In the ggml, whisper.cpp and llama.cpp repositories:

ggml_print_objects is only used in a test
ggml_get_tensor, ggml_get_max_tensor_size aren't used at all
ggml_used_mem can be implemented without ggml_object

I guess that it was used for debugging early on, but currently it seems to be unnecessary.

If we want to keep ggml_object, it would have to be extended to contain the type of object, ie. tensor, graph, work buffer, maybe also "user object" if we implement a ggml_alloc as @xaedes suggested.

ggerganov · 2023-07-25T16:27:11Z

ggml_object was envisioned as a thing to wrap different types of objects that we allocate in the context, but indeed now we only have tensors so it seems a bit pointless. We can keep it and try to utilize it for graph, buffers, etc. as you suggest, unless you have a better way for extending support for such kind of new structures without using ggml_object.

slaren · 2023-07-25T16:29:59Z

I think it is only useful if we want to be able to enumerate all the objects allocated in a ggml_context, but I am not sure that we really need that. Otherwise, we would only need to keep the current offset of the buffer in ggml_context to know where to allocate new objects.

ggerganov · 2023-07-25T16:34:25Z

I recently needed the ggml_get_max_tensor_size for some Metal tricks, though they didn't end up merged.
Maybe lets keep ggml_object for now - I think it is simple enough and opens up room for future improvements.

Co-authored-by: Johnman <> Co-authored-by: Johnman <tjohnman@github>

ggerganov added enhancement New feature or request good first issue Good for newcomers labels Jun 25, 2023

mqy mentioned this issue Jul 3, 2023

ggml : change ggml_graph_compute() API to not require context ggerganov/llama.cpp#1999

Merged

ggerganov mentioned this issue Jul 25, 2023

Improve graph build time ggerganov/llama.cpp#2329

Merged

slaren mentioned this issue Jul 25, 2023

ggml : allocate graphs in a context ggerganov/llama.cpp#2392

Merged

slaren closed this as completed in ggerganov/llama.cpp#2392 Jul 26, 2023

ggerganov assigned slaren Jul 26, 2023

CCLDArjun pushed a commit to CCLDArjun/ggml that referenced this issue Dec 18, 2023

Support for multiple reverse prompts. (ggerganov#299)

2456837

Co-authored-by: Johnman <> Co-authored-by: Johnman <tjohnman@github>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml : improve API to allow allocating compute graphs on the heap #299

ggml : improve API to allow allocating compute graphs on the heap #299

ggerganov commented Jun 25, 2023

xaedes commented Jun 25, 2023

slaren commented Jun 25, 2023

ggerganov commented Jul 2, 2023

slaren commented Jul 2, 2023

ggerganov commented Jul 2, 2023

slaren commented Jul 25, 2023 •

edited

Loading

ggerganov commented Jul 25, 2023

slaren commented Jul 25, 2023

ggerganov commented Jul 25, 2023

ggml : improve API to allow allocating compute graphs on the heap #299

ggml : improve API to allow allocating compute graphs on the heap #299

Comments

ggerganov commented Jun 25, 2023

xaedes commented Jun 25, 2023

slaren commented Jun 25, 2023

ggerganov commented Jul 2, 2023

slaren commented Jul 2, 2023

ggerganov commented Jul 2, 2023

slaren commented Jul 25, 2023 • edited Loading

ggerganov commented Jul 25, 2023

slaren commented Jul 25, 2023

ggerganov commented Jul 25, 2023

slaren commented Jul 25, 2023 •

edited

Loading