Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why add 512 : ggml_backend_alloc_buffer(backend_kv, memory_size + 512*2); #637

Closed
EveningLin opened this issue Dec 6, 2023 · 4 comments
Closed

Comments

@EveningLin
Copy link

image
image

I don't understand why 512 needs to be added, but it looks like a routine operation. May I ask what it aims at?

@ggerganov
Copy link
Owner

ggerganov commented Dec 6, 2023

Edit: ignore this comment, I was confused. See next comment by @slaren


The ggml_context contains the "meta" information for each tensor - this is the ggml_tensor struct. It can also optionally contain the tensor data, if the ggml_context is created with .no_alloc = false.

The 512 were the old way to account for the size overhead of the meta information.
Today, you should use the dedicated ggml_tensor_overhead() and ggml_graph_overhead() functions for that. Here is an example:

// since we are using ggml-alloc, this buffer only needs enough space to hold the ggml_tensor and ggml_cgraph structs, but not the tensor data
static size_t buf_size = ggml_tensor_overhead()*GPT2_MAX_NODES + ggml_graph_overhead_custom(GPT2_MAX_NODES, false);
static std::vector<uint8_t> buf(buf_size);

The code in the first screenshot that you posted should be rewritten like this:

// old
model.buffer_kv = ggml_backend_alloc_buffer(backend_kv, memory_size + 512*2);

// new
model.buffer_kv = ggml_backend_alloc_buffer(backend_kv, memory_size + 2*ggml_tensor_overhead());

The backend_kv context contains 2 tensors:

  • the total data size of the 2 tensors is memory_size
  • the overhead from their meta information is 2*ggml_tensor_overhead()

In some case, you can create the ggml_context with .no_alloc = true. In these cases, the ggml_context will contain only the meta information without the tensor data. So the buffer size in the example above would be just:

model.buffer_kv = ggml_backend_alloc_buffer(backend_kv, 2*ggml_tensor_overhead());

@slaren
Copy link
Collaborator

slaren commented Dec 6, 2023

I think there is some confusion here. The 512 is added to account for any possible padding and alignment overhead that the backend may require. The way to calculate the exact memory requirements can be seen in ggml_backend_alloc_ctx_tensors_from_buft:

ggml/src/ggml-alloc.c

Lines 766 to 777 in 3f66942

ggml_backend_buffer_t ggml_backend_alloc_ctx_tensors_from_buft(struct ggml_context * ctx, ggml_backend_buffer_type_t buft) {
GGML_ASSERT(ggml_get_no_alloc(ctx) == true);
size_t alignment = ggml_backend_buft_get_alignment(buft);
size_t nbytes = 0;
for (struct ggml_tensor * t = ggml_get_first_tensor(ctx); t != NULL; t = ggml_get_next_tensor(ctx, t)) {
if (t->data == NULL && t->view_src == NULL) {
nbytes += GGML_PAD(ggml_backend_buft_get_alloc_size(buft, t), alignment);
}
}

It was added recently and not every example has been updated to use it, but generally, if you have a fixed list of tensors that you need to allocate into a backend buffer, using this function is the preferred way to allocate them (when using ggml-backend).

The backend buffers can only be used to store the tensor data, not the tensor metadata, so it is never necessary to add any ggml_tensor_overhead to the size of the buffers.

@EveningLin
Copy link
Author

thank all of you for answering my questions ! i am also confused hahaha
model.buffer_kv = ggml_backend_alloc_buffer(backend_kv, 2*ggml_tensor_overhead());It is an incorrect way of expression??
can 512 be replaced by ggml_tensor_overhead?
512 is just an estimate of the memory used to solve alignment and other issues??

@slaren
Copy link
Collaborator

slaren commented Dec 7, 2023

You should only use ggml_tensor_overhead in the context buffers, which is where the tensor metadata is allocated, not in the backend buffers that are only used for the tensor data.

512 is just an estimate of the memory used to solve alignment and other issues??

Yes, this is correct. If possible, use ggml_backend_alloc_ctx_tensors_from_buft or ggml_backend_alloc_ctx_tensors instead of calculating the buffer size yourself.

@slaren slaren closed this as completed Jan 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants