-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[wip] ggml-backend v2 : add ggml_backend_sched #586
Conversation
0f80411
to
4b319cd
Compare
Currently, I can see two ways to address this:
The second option is probably the right one, but current code would need to be updated to use |
Let's discuss how to implement the second option - I've sketched some rough notes here: #567 (comment) |
2257cf7
to
719f08d
Compare
add ggml_opt_params::graph_size add ggml_new_graph_custom, ggml_graph_overhead_custom add ggml_graph_clear
96e56b4
to
2105a49
Compare
// initialize the scheduler | ||
sched = ggml_backend_sched_new(model.backends.data(), model.backends.size()); | ||
|
||
// create the worst case graph for memory usage estimation | ||
int n_tokens = std::min(model.hparams.n_ctx, params.n_batch); | ||
int n_past = model.hparams.n_ctx - n_tokens; | ||
struct ggml_cgraph * gf = gpt2_graph(model, allocr, n_past, std::vector<gpt_vocab::id>(n_tokens, 0)); | ||
struct ggml_cgraph * gf = gpt2_graph(model, n_past, std::vector<gpt_vocab::id>(n_tokens, 0)); | ||
|
||
// compute the required memory | ||
size_t mem_size = ggml_allocr_alloc_graph(allocr, gf); | ||
ggml_backend_sched_init_measure(sched, gf); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also works, correct? Just for my understanding.
diff --git a/examples/gpt-2/main.cpp b/examples/gpt-2/main.cpp
index 1a24925..a62a2b9 100644
--- a/examples/gpt-2/main.cpp
+++ b/examples/gpt-2/main.cpp
@@ -956,13 +956,13 @@ int main(int argc, char ** argv) {
ggml_backend_sched_t sched;
{
// initialize the scheduler
- sched = ggml_backend_sched_new(model.backends.data(), model.backends.size());
-
// create the worst case graph for memory usage estimation
int n_tokens = std::min(model.hparams.n_ctx, params.n_batch);
int n_past = model.hparams.n_ctx - n_tokens;
struct ggml_cgraph * gf = gpt2_graph(model, n_past, std::vector<gpt_vocab::id>(n_tokens, 0));
+ sched = ggml_backend_sched_new(model.backends.data(), model.backends.size());
+
ggml_backend_sched_init_measure(sched, gf);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that should work since it pre-allocates the inputs from a different buffer, so the gpt2_graph
function doesn't need the sched object.
However, the previous pattern of allocating the inputs manually by calling ggml_allocr_alloc
is still supported by using ggml_backend_sched_get_tallocr
to obtain the allocators for each backend. In that case, the sched object would need to be created before the graph.
// The backend scheduler allows for multiple backends to be used together | ||
// Handles compute buffer allocation, assignment of tensors to backends, and copying of tensors between backends | ||
// The backends are selected based on: | ||
// - the backend that supports the operation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- the backend that supports the operation
Can you explain how this part works? How does the scheduler know which ops are supported
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't implemented yet (this is the automatic fallback to the CPU backend), but it will do this by calling the supports_op
function of the backend interface.
Wow, this looks very powerful. I'm still going through the code, but overall I really like what I see.
I understand the multi-device layer split, but how do you imagine tensor level splits to work? I see that #549 #567 are resolved which is great. What is the next thing that should be resolved for the backend interface to function better / be easier to use? Shall we look into merging and using this in |
I think that we are talking about the same thing here. What I mean by "tensor level" is that each tensor can be individually assigned to a different backend or device. What I think you mean here is what I would call splitting at the row level, which is what
It shouldn't take long, but there are still a few steps missing before this can be used in
I would also like to implement a backend registry so that applications can query the list of available backends in a generic way. Ideally this would allow building llama.cpp with both CUDA and OpenCL and check availability at run time, and even mix CUDA and OpenCL devices. |
The refactoring in ggerganov/llama.cpp#3837 should still be helpful, correct? At least it moves the CUDA specifics outside the build functions which is inline with this interface. The |
I think most of it will still be useful. Ideally, the graph splitting logic should be good enough that we won't need to assign tensors to a specific backend (and I think that's possible to implement, but it isn't quite there yet, the splits that it generates are not optimal). If necessary, it is still possible to assign manually tensors to a specific backend with |
@slaren Planning to do one more sync between |
The training examples will need some changes, at least:
It will also require some testing to make sure that the changes to |
I think only the graph copying in the training examples is left to update in the sync PR, but I'm not sure what it refers to. |
To perform a deep copy of a graph now we need to use https://github.com/ggerganov/llama.cpp/blob/16e819d53ce5bb7025a545bd45b6404c16f3d432/examples/finetune/finetune.cpp#L775 |
Adds support for computing graphs using multiple backends through
ggml_backend_sched
.Currently, this allows partially offloading models to the GPU, and establishes the framework necessary for more advanced uses, such as:
Basic functionality should be working now, but still needs some work, especially in the graph splitting logic.
Fixes #549 #567