Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml : improve API to allow allocating compute graphs on the heap #299

Closed
ggerganov opened this issue Jun 25, 2023 · 9 comments · Fixed by ggerganov/llama.cpp#2392
Closed
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@ggerganov
Copy link
Owner

Currently, ggml forces the user to allocate the compute graphs on the stack. The ggml API should be extended to support using heap allocated graphs.

@ggerganov ggerganov added enhancement New feature or request good first issue Good for newcomers labels Jun 25, 2023
@xaedes
Copy link

xaedes commented Jun 25, 2023

I am currently working on memory improvements for training and testing with bigger models than before.
When training big models the maximum node and parameter count needs to be increased.
At some point stack overflows happened due to big graph struct size...

To solve this I allocate heap memory for the graphs by using the data buffer of a new tensor and then use only build_expand functions instead of the regular build functions.
This required a new ggml_build_backward_expand function which takes the backward graph as pointer parameter, like the forward graph. ggml_build_backward just calls this function.

GGML_API void ggml_build_backward_expand(struct ggml_context * ctx, struct ggml_cgraph * gf, struct ggml_cgraph * gb, bool keep);

Example for allocating new graph:

struct ggml_tensor * gfbuf = ggml_new_tensor_1d(ctx0, GGML_TYPE_I32, sizeof(struct ggml_cgraph) / ggml_type_size(GGML_TYPE_I32) + (sizeof(struct ggml_cgraph) % ggml_type_size(GGML_TYPE_I32) ? 1 : 0));
memset(gfbuf->data, 0, ggml_nbytes(gfbuf));
struct ggml_cgraph * gf = (struct ggml_cgraph *) gfbuf->data;

This seems to be enough for solving the stackoverflows in training.

In llama.cpp inference code there are only a few locations where a cgraph is allocated on stack.
Replacing this occurences with heap allocated graphs and then just using ggml_build_forward_expand should be straight forward.

A function to directly allocate an ggml object with enough bytes from the context would be nice!
Making the new tensor and recasting of the tensor->data looks a bit hackish..

Maybe something like this:

GGML_API void * ggml_alloc(struct ggml_context * ctx, size_t nbytes);
...
struct ggml_cgraph * gf = ggml_alloc(ctx0, sizeof(struct ggml_cgraph));

I prefer allocating the memory from the context over just 'malloc', so that the machinery for freeing the context and all its related memory can be reused.

@slaren
Copy link
Collaborator

slaren commented Jun 25, 2023

I think it is a good idea allocating it from the context memory, but it should be automatic. This could be done by passing a context to ggml_build_forward and either:

  • Counting the number of nodes in a first pass (this requires adding a visited flag to ggml_tensor, but that should be done for performance anyway) and allocating the exact amount of memory needed
  • Growing the buffer as needed from the ggml_context. Since it will be the last object in the ggml_context for the entire duration of the expansion, this can be done safely

To make this work, the graph must only be built in one step, but I think that ggml_build_forward_expand should be removed and instead replaced by making sure that all the dependencies are represented in the graph (see this for more details). Is there any case for which ggml_build_forward_expand is required that cannot be solved by adding a dependency to the graph?

@ggerganov
Copy link
Owner Author

Is there any case for which ggml_build_forward_expand is required that cannot be solved by adding a dependency to the graph?

Don't think there is such case. I agree we should aim to eliminate ggml_build_forward_expand() eventually

I think it is a good idea allocating it from the context memory, but it should be automatic.

Yes, agreed. The only problem might be that after ggerganov/llama.cpp#1999 there will be no longer an "eval" ggml_context as it will be replaced with ggml_cgraph_context - maybe we can instead pass ggml_cgraph_context to ggml_build_forward()

@slaren
Copy link
Collaborator

slaren commented Jul 2, 2023

I don't think that's necessary. We can pass the same ggml_context used to allocate the tensors of the graph to ggml_build_forward. This still maintains a separation between building the graph and running it. This is what I am thinking:

ggml_tensor * output = ggml_mul(ctx, ...);
ggml_cgraph gf = ggml_build_forward(ctx, output); // gf will be allocated in ctx

// execute the graph on the CPU
ggml_cgraph_context plan = ggml_graph_compute_plan(&gf);
plan.work_data = malloc(plan.work_size);
ggml_graph_compute(plan);

// execute the same graph with CUDA
ggml_cuda_graph_context cuda_plan = ggml_cuda_compute_plan(&gf);
ggml_cuda_graph_compute(cuda_plan);

@ggerganov
Copy link
Owner Author

Ah correct - I was a bit confused. Ignore my last comment

@slaren
Copy link
Collaborator

slaren commented Jul 25, 2023

@ggerganov should we remove ggml_object? In the ggml, whisper.cpp and llama.cpp repositories:

  • ggml_print_objects is only used in a test
  • ggml_get_tensor, ggml_get_max_tensor_size aren't used at all
  • ggml_used_mem can be implemented without ggml_object

I guess that it was used for debugging early on, but currently it seems to be unnecessary.

If we want to keep ggml_object, it would have to be extended to contain the type of object, ie. tensor, graph, work buffer, maybe also "user object" if we implement a ggml_alloc as @xaedes suggested.

@ggerganov
Copy link
Owner Author

ggml_object was envisioned as a thing to wrap different types of objects that we allocate in the context, but indeed now we only have tensors so it seems a bit pointless. We can keep it and try to utilize it for graph, buffers, etc. as you suggest, unless you have a better way for extending support for such kind of new structures without using ggml_object.

@slaren
Copy link
Collaborator

slaren commented Jul 25, 2023

I think it is only useful if we want to be able to enumerate all the objects allocated in a ggml_context, but I am not sure that we really need that. Otherwise, we would only need to keep the current offset of the buffer in ggml_context to know where to allocate new objects.

@ggerganov
Copy link
Owner Author

I recently needed the ggml_get_max_tensor_size for some Metal tricks, though they didn't end up merged.
Maybe lets keep ggml_object for now - I think it is simple enough and opens up room for future improvements.

CCLDArjun pushed a commit to CCLDArjun/ggml that referenced this issue Dec 18, 2023
Co-authored-by: Johnman <>
Co-authored-by: Johnman <tjohnman@github>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants