Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml : improve memory management #288

Closed
ggerganov opened this issue Jun 25, 2023 · 3 comments · Fixed by ggerganov/llama.cpp#2411
Closed

ggml : improve memory management #288

ggerganov opened this issue Jun 25, 2023 · 3 comments · Fixed by ggerganov/llama.cpp#2411
Labels
refactoring Refactoring

Comments

@ggerganov
Copy link
Owner

One of the biggest problems with ggml currently is that the user needs to manually pre-calculate the necessary sizes for all the ggml_context objects that they create. This is a result of the goal to have as little memory allocations as possible during runtime. However, it resulted in an unpleasant experience and needs to be improved.

Additionally, the "scratch buffer" mechanism is also very difficult to use and needs to be overhauled as well.

This will be quite a big change to the core library and there are many different ways to approach it, so for now I will keep the description of the issue short. Tagging @slaren as he had shared some nice ideas regarding this topic, which we can discuss further here and decide on a good strategy to implement them

@ggerganov ggerganov added the refactoring Refactoring label Jun 25, 2023
@slaren
Copy link
Collaborator

slaren commented Jun 25, 2023

The main goal is to reduce the memory usage of intermediate results, and additionally to also simplify the overall memory management in ggml.

By default, tensors are immediately allocated in the context memory and never freed until the context is deleted. This is great for constant such as the model weights and input parameters, but not so great for intermediate results of computations. Scratch buffers can alleviate this somewhat, but they are not optimal and require a lot of tedious, error-prone work from the user.

The proposed solution would be to delay memory allocation of intermediate results until the graph has been completely built. Once we have the graph, we can determine the point at which each tensor is no longer needed and can be freed. In principle, we could use only strictly as much memory as is necessary in this way, but in practice it is likely that there will still be some memory lost to fragmentation.

Tensors such as the weights and input parameters would still need to be allocated immediately so that the user can modify their values. This could be done by having two types of ggml_context, an "immediate context" that would behave similarly to the way it works currently, and a different type for intermediate results (let's call it a "compute context"). Tensors created in a compute context wouldn't have their memory allocated until the graph is built, and the user wouldn't be able to access their data directly (1).

To determine how much memory has to be allocated for intermediate results, we would have a function that would take a graph and return the required size of the compute buffer. Initialization of the compute buffers could typically be done by creating a graph of maximum batch size and running it through this function. This is something that can be done very quickly, and I would expect that the entire process would take less than one millisecond (2).

To make this work, all the dependencies between tensors would need to be represented in the graph. Currently, sometimes this is avoided by calling ggml_build_forward_expand. This would not be allowed anymore, and instead a dependency would need to be introduced in the graph, either directly or through a new operation that could be called ggml_dependency. A call to ggml_dependency(a, b) would return a unmodified, but would add a dependency to b in the graph. It is essentially a view that also adds a dependency. For example, in llama.cpp instead of:

// store key and value to memory
// k and v are views of kv_self.k and kv_self.v
ggml_build_forward_expand(&gf, ggml_cpy(ctx0, Kcur, k));
ggml_build_forward_expand(&gf, ggml_cpy(ctx0, Vcur, v));
// code below uses kv_self.k and kv_self.v

it would be done such as:

k = ggml_dependency(ctx, kv_self.k, ggml_cpy(ctx0, Kcur, k));
v = ggml_dependency(ctx, kv_self.v, ggml_cpy(ctx0, Vcur, v));
// code below uses uses k and v

We could also avoid this and other copies entirely if we had a way to specify the output tensor of the operation. For example, we could add a function ggml_set_output that could be used to manually specify the address of tensors allocated in an intermediate buffer:

struct ggml_tensor * Kcur = ggml_rope(ctx, ...);
struct ggml_tensor * k = ggml_view_1d(ctx, kv_self.k, ...);
ggml_set_output(Kcur, k);
// Kcur will no longer be allocated in the intermediate buffer, instead,
// the result will be stored in the same memory as k, and the ggml_cpy is no longer necessary

(1): At least, not until the graph has been executed. Even then, it would not be guaranteed that their memory has not been overwritten unless they are the last operation in the graph. A better solution would be to allocate output tensors in an immediate context and copy the result to them, or use ggml_set_output as suggested here to avoid the copy entirely.
(2): ggml_build_forward is slow currently due to the way it avoids re-visiting the same nodes multiple times, but we have a fix for that already.

@thriverLio
Copy link

I don't think reuse the memory can give any performance improvement, in contrast, it will make the memory hard to manage.
As the modern OS have provide the memory management system, so give this mission to os. the ggml should move focus on caculation performace not memory.
Otherwise it will make a big obstacle to let others using the lib.

@slaren
Copy link
Collaborator

slaren commented Jun 30, 2023

An alternative to the ggml_set_output described above would be to do an optimization pass on the graph and remove unnecessary ggml_cpy ops by setting the destination of the parent op to the destination of the copy. This should be fairly simple to implement, and would make it easier to use.

Similarly, ops could be made in-place automatically when possible. The _inplace versions of ggml ops could then be removed.

I would also prefer if ggml_dependency wasn't necessary, but I think that would require deeper changes to ggml that would be better to avoid for now. We can revisit that in the future, but for now requiring ggml_dependency instead of ggml_build_forward_expand shouldn't be too bad.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refactoring Refactoring
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants