Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml-alloc v3 #727

Merged
merged 6 commits into from
Feb 11, 2024
Merged

ggml-alloc v3 #727

merged 6 commits into from
Feb 11, 2024

Conversation

slaren
Copy link
Collaborator

@slaren slaren commented Feb 9, 2024

Overview of the changes

Graph allocator

  • Measure allocators have been removed
  • The graph allocator works in two steps:
    • Reserve (ggml_gallocr_reserve): calculates the offsets within the buffer where to allocate all the tensors in the graph
    • Allocate (ggml_gallocr_alloc_graph): allocates the tensors using the list of offsets generated in the reserve step
  • The reserve step is done automatically when the graph topology changes or the tensor sizes increase
  • It is not necessary to call ggml_gallocr_reserve manually, however, doing so with a worst-case graph will avoid buffer reallocations
  • Unlike the measure graphs in the previous version, the graphs used to reserve are not modified, and can be used directly with ggml_gallocr_alloc_graph. When only one graph needs to be evaluated, there is no need to create a different copy for measure.
  • Graph allocation cannot fail now due to out of space in the buffers (but the buffer allocation may still fail)
  • The buffers are private to the graph allocator and cannot be accessed directly
  • It is no longer possible to allocate tensors manually. Instead, inputs must be flagged with ggml_set_input, and set after the graph has been allocated. Setting the input flag will ensure that the tensors are not overwritten before they are used in the graph.
  • It is possible to set a tensor as an output with ggml_set_output. This will ensure that the outputs are never overwritten, removing the need of hacks such as adding a dummy dependency at the end of the graph.

Tensor allocator

  • There is still a ggml_tallocr that can be used to allocate tensors, but it has been reworked
  • This is now a very lightweight allocator that cannot free tensors, and its only state is a buffer and the current offset within the buffer
  • Applications should use ggml_backend_alloc_ctx_tensors when possible since it handles all the details of tensor allocation, including splitting the tensors into multiple buffers if necessary, but ggml_tallocr can still be used for more advanced cases

Other

  • Renamed gpt-2 ggml_backend_sched example target to gpt-2-sched (was gpt-2-backend2), source file to main-sched.cpp (was main.cpp).

@slaren
Copy link
Collaborator Author

slaren commented Feb 9, 2024

@ggerganov There are some changes here from llama.cpp, I will rebase after the next sync

@ggerganov
Copy link
Owner

Ok, will sync tomorrow morning

@YavorGIvanov
Copy link
Collaborator

It is possible to set a tensor as an output with ggml_set_output. This will ensure that the outputs are never overwritten, removing the need of hacks such as adding a dummy dependency at the end of the graph.

That will be very useful. Great. I have macro guarded hacks in all backends in order to do this easily :D

@slaren slaren force-pushed the sl/alloc-v3 branch 3 times, most recently from 6efa534 to 8005421 Compare February 10, 2024 01:09
@ggerganov
Copy link
Owner

Should be OK to rebase now

@slaren slaren marked this pull request as ready for review February 10, 2024 12:41
@slaren
Copy link
Collaborator Author

slaren commented Feb 10, 2024

Thank you! Other than some cleanup and removing some prints, this should be good to review. I have also updated whisper.cpp and made a few more changes to it, such as using ggml_backend_alloc_ctx_tensors.

@slaren
Copy link
Collaborator Author

slaren commented Feb 10, 2024

I am not sure why the mpt test in the ggml CI is failing, it works for me locally, and it shouldn't be affected by the changes. From the logs I suspect that something is failing during the model conversion.

@ggerganov
Copy link
Owner

ggerganov commented Feb 10, 2024

It needs some python module:

https://github.com/ggml-org/ci/blob/2e349ee53c4f858b48b8f8e222ad0e46f118928e/ggml/16/ac25c2fa308202c32927d131b33265f5588bc3/ggml-3-arm64-cpu/stdall#L3042C1-L3044C46

Nevermind, let's remove it #728

examples/whisper/whisper.cpp Outdated Show resolved Hide resolved
Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvements and simpler API 👍

Merge at will

include/ggml/ggml.h Outdated Show resolved Hide resolved
ggml-ci
@slaren
Copy link
Collaborator Author

slaren commented Feb 11, 2024

Will merge after CI.

@ggerganov what would be the best way to sync these changes in llama.cpp? I am thinking that either you could open a sync PR and I would add the changes necessary to llama.cpp there, or I could open a new PR that includes all the changes here.

@ggerganov ggerganov merged commit 5070f07 into master Feb 11, 2024
10 checks passed
@ggerganov ggerganov deleted the sl/alloc-v3 branch February 11, 2024 12:38
@ggerganov
Copy link
Owner

I'll open a sync PR in llama.cpp now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants