Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow prechecking how much memory tensor creation will require #1349

Closed

Conversation

KerfuffleV2
Copy link
Collaborator

@KerfuffleV2 KerfuffleV2 commented May 7, 2023

Add ggml_tensor_required_memory function to calculate how much memory creating a tensor will require.

Also add ggml_used_scratch_mem function so applications can determine how much scratch memory was used. This trivial function will just return 0 if there's no current scratch buffer.

ggml_tensor_required_memory takes the same arguments as ggml_new_tensor_impl with the addition of two size_t pointers(required_ctx, and required_scratch). The logic is basically just a cut-and-paste from ggml_new_tensor_impl: It will calculate how much scratch and context memory is required and increment the the supplied size_t pointers by that amount. Then it returns true/false depending on whether those values fit in the context/scratch memory.

The reason it increments rather than just setting the required_ctx/_scratch arguments is to allow checking that multiple tensors will fit in advance of actually trying to create them. This could be considered a partial replacement for #1325 since memory limits are probably the most common reason for GGML to bail out during graph construction.

It also is something that can facilitate #1325 to recover in the case of operations that require creating multiple tensors like the map ops.

I converted ggml_new_tensor_impl to use the new function for its memory validation. Right now I think it's not taking full advantage of the change and recalculates some stuff.

Note: This isn't quite a draft, but it is lightly tested. It compiles without warnings for me and llama.cpp built on it can load/run inference on models. If this is something that would get merged, it would be a good idea to have better C programmers than me look at it closely.

… creating a tensor will require.

Also add ggml_used_scratch_mem function so applications can determine how much scratch memory was used.
@KerfuffleV2 KerfuffleV2 added the enhancement New feature or request label May 7, 2023
@KerfuffleV2
Copy link
Collaborator Author

@ggerganov If you have a moment, I could use some direction here and in #1325.

Right now there are merge conflicts since quite a bit of time has passed, but it's not clear if these pulls are something you'd ever want to merge, whether you want me to change my approach, etc.

If I'm on the right track, of course I would be happy fix them and get these pulls into a mergeable state. But right now it's not clear if that would be wasted effort or not.

@ggerganov
Copy link
Owner

I don't see how the proposed function will be used in practice.
Lets say I have:

c = ggml_op(ctx, a, b);

How do I use ggml_tensor_required_memory() to calculated in advance the memory needed for this operation?

I still think there is a better solution by using the no_alloc parameter of the ggml_context:

bool no_alloc; // don't allocate memory for the tensor data

The idea is you run the eval function once with no_alloc == true without calling ggml_graph_compute() at the end to record how much memory is needed. After that you run the same eval function, but this time with no_alloc == false.
I guess we might need to add some extra size counters to the ggml_context to keep track of the required memory during the first pass, but that should be a simple change.

@KerfuffleV2
Copy link
Collaborator Author

KerfuffleV2 commented May 21, 2023

Thanks for the reply!

I don't see how the proposed function will be used in practice.

It's not really something that's aimed at individual developers consuming the GGML library directly.

To use it, you have to know fairly internal details about what the dimensions of the tensor for that op will be, whether it creates other sub-tensors, etc.

The main use case I was thinking of is for #1325:

Right now there are two factors that make actually recovering consistently impractical:

  1. It's impossible to free a tensor after it's created without freeing the entire context.
  2. Some tensor ops create other tensors behind the scenes. For example, my own ggml_map_binary_f32 operation: it needs to create another tensor to hold the function pointer.

Because of point 2, if the second tensor create fails there is just no way to recover without having to free the whole context and start again.

So the place where it would be useful immediately is for internal use, assuming the recoverable errors pull is something that would actually get merged.

It is also something that could be useful for larger projects like language bindings or wrappers for GGML. A binding can't require its consumers to do something like do the no_alloc song and dance. Just crashing the entire application for a recoverable error, which originates in some library is very unexpected behavior in that sort of situation. Especially for languages where correctness/safety is a big selling point like Rust. Someone using a Rust crate is not going to expect a call into an API with a recoverable error to bring their entire application down.

That also would be the case for high level languages like Python. Imagine you run a Python script to parse some JSON file and because of a typo in the file it actually calls abort() and kills the whole program without the chance for recovery. Someone using a Python package wouldn't expect (or want) that behavior.

Anyway, that's my case for just allowing errors to be recoverable in general. I want to be clear that I'm not talking about situations where one of GGML's internal invariants was violated (implying a bug in GGML). In that case, it makes perfect sense to abort() and force the problem to be dealt with prompt.

edit: In case it wasn't clear, although someone writing a binding wouldn't really want to have to dig through the GGML source code to find internal details like whether an op creates sub tensors and then ensure that it fits, ggml_tensor_required_memory would make it possible.

@KerfuffleV2
Copy link
Collaborator Author

Closing this, at least for now as it doesn't seem like there's interest.

@ggerganov
Copy link
Owner

Apologies for ignoring these - the proposed changes require more thought from my side and are mainly relevant for 3rd party software integration. This is currently low priority, but will become important in the future

Will eventually get back to this after finishing some pending core features

@KerfuffleV2
Copy link
Collaborator Author

No problem (also no passive aggressive motivation for closing this, just doesn't really make too much sense for just one person to be pushing for it). I've also been looking at alternative ways to solve the problem without needing those changes in my own project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants