Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wip] ggml-backend v2 : add ggml_backend_sched #586

Merged
merged 21 commits into from
Oct 30, 2023
Merged

Conversation

slaren
Copy link
Collaborator

@slaren slaren commented Oct 18, 2023

Adds support for computing graphs using multiple backends through ggml_backend_sched.

Currently, this allows partially offloading models to the GPU, and establishes the framework necessary for more advanced uses, such as:

  • Automatic fallback to the CPU backend for ops unimplemented in the GPU backends
  • Using any combination of backends, for example some layers could be offloaded with CUDA and others with OpenCL
  • Using multiple CUDA or OpenCL devices by splitting the computation at a layer or tensor level

Basic functionality should be working now, but still needs some work, especially in the graph splitting logic.

Fixes #549 #567

@slaren slaren changed the title ggml-backend v2 : add ggml_backend_sched [wip] ggml-backend v2 : add ggml_backend_sched Oct 18, 2023
@ggerganov ggerganov self-requested a review October 18, 2023 21:09
@slaren slaren force-pushed the ggml-backend-v2 branch 2 times, most recently from 0f80411 to 4b319cd Compare October 18, 2023 21:43
@slaren
Copy link
Collaborator Author

slaren commented Oct 19, 2023

Currently, ggml_backend_sched allocates a graph for every backend split and copies the range of operations corresponding to the split from the original graph. This not only is not very efficient, but also wastes a large amount of memory due to the size of the ggml_cgraph. ggml_backend_sched objects are 160MB due to this.

I can see two ways to address this:

  • Modify the graph_compute functions to take the indices of a range of nodes to evaluate
  • Modify ggml_cgraph for dynamic allocation, then we could reuse the nodes pointer from the original graph

The second option is probably the right one, but current code would need to be updated to use ggml_new_graph or similar. We would also need to the decide on the implementation of dynamically sized ggml_cgraph, which is not very clear at the moment.

@ggerganov
Copy link
Owner

Let's discuss how to implement the second option - I've sketched some rough notes here: #567 (comment)

include/ggml/ggml.h Outdated Show resolved Hide resolved
src/ggml.c Outdated Show resolved Hide resolved
@slaren slaren marked this pull request as ready for review October 21, 2023 23:36
@slaren slaren linked an issue Oct 21, 2023 that may be closed by this pull request
Comment on lines +958 to +966
// initialize the scheduler
sched = ggml_backend_sched_new(model.backends.data(), model.backends.size());

// create the worst case graph for memory usage estimation
int n_tokens = std::min(model.hparams.n_ctx, params.n_batch);
int n_past = model.hparams.n_ctx - n_tokens;
struct ggml_cgraph * gf = gpt2_graph(model, allocr, n_past, std::vector<gpt_vocab::id>(n_tokens, 0));
struct ggml_cgraph * gf = gpt2_graph(model, n_past, std::vector<gpt_vocab::id>(n_tokens, 0));

// compute the required memory
size_t mem_size = ggml_allocr_alloc_graph(allocr, gf);
ggml_backend_sched_init_measure(sched, gf);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also works, correct? Just for my understanding.

diff --git a/examples/gpt-2/main.cpp b/examples/gpt-2/main.cpp
index 1a24925..a62a2b9 100644
--- a/examples/gpt-2/main.cpp
+++ b/examples/gpt-2/main.cpp
@@ -956,13 +956,13 @@ int main(int argc, char ** argv) {
     ggml_backend_sched_t sched;
     {
         // initialize the scheduler
-        sched = ggml_backend_sched_new(model.backends.data(), model.backends.size());
-
         // create the worst case graph for memory usage estimation
         int n_tokens = std::min(model.hparams.n_ctx, params.n_batch);
         int n_past = model.hparams.n_ctx - n_tokens;
         struct ggml_cgraph * gf = gpt2_graph(model, n_past, std::vector<gpt_vocab::id>(n_tokens, 0));
 
+        sched = ggml_backend_sched_new(model.backends.data(), model.backends.size());
+
         ggml_backend_sched_init_measure(sched, gf);
 

Copy link
Collaborator Author

@slaren slaren Oct 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that should work since it pre-allocates the inputs from a different buffer, so the gpt2_graph function doesn't need the sched object.

However, the previous pattern of allocating the inputs manually by calling ggml_allocr_alloc is still supported by using ggml_backend_sched_get_tallocr to obtain the allocators for each backend. In that case, the sched object would need to be created before the graph.

// The backend scheduler allows for multiple backends to be used together
// Handles compute buffer allocation, assignment of tensors to backends, and copying of tensors between backends
// The backends are selected based on:
// - the backend that supports the operation
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • the backend that supports the operation

Can you explain how this part works? How does the scheduler know which ops are supported

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't implemented yet (this is the automatic fallback to the CPU backend), but it will do this by calling the supports_op function of the backend interface.

@ggerganov
Copy link
Owner

Wow, this looks very powerful. I'm still going through the code, but overall I really like what I see.

Using multiple CUDA or OpenCL devices by splitting the computation at a layer or tensor level

I understand the multi-device layer split, but how do you imagine tensor level splits to work?

I see that #549 #567 are resolved which is great. What is the next thing that should be resolved for the backend interface to function better / be easier to use?

Shall we look into merging and using this in llama.cpp soon?

@slaren
Copy link
Collaborator Author

slaren commented Oct 30, 2023

I understand the multi-device layer split, but how do you imagine tensor level splits to work?

I think that we are talking about the same thing here. What I mean by "tensor level" is that each tensor can be individually assigned to a different backend or device. What I think you mean here is what I would call splitting at the row level, which is what ggml-cuda does currently. I don't think that this can be realistically done with a generic backend interface because the synchronization needs to be implemented at a lower level than what ggml-backend supports.

Shall we look into merging and using this in llama.cpp soon?

It shouldn't take long, but there are still a few steps missing before this can be used in llama.cpp, at least:

  • Automatic fallback to CPU support
  • ggml-opencl support
  • Improved graph splitting logic
  • Support for the current row-level multi GPU implementation of ggml-cuda in ggml-backend (note this will be specific to ggml-cuda, not generic for any backend)

I would also like to implement a backend registry so that applications can query the list of available backends in a generic way. Ideally this would allow building llama.cpp with both CUDA and OpenCL and check availability at run time, and even mix CUDA and OpenCL devices.

@ggerganov
Copy link
Owner

The refactoring in ggerganov/llama.cpp#3837 should still be helpful, correct? At least it moves the CUDA specifics outside the build functions which is inline with this interface. The cb() stuff and name lookups can easily be removed when we no longer need them.

@slaren
Copy link
Collaborator Author

slaren commented Oct 30, 2023

I think most of it will still be useful. Ideally, the graph splitting logic should be good enough that we won't need to assign tensors to a specific backend (and I think that's possible to implement, but it isn't quite there yet, the splits that it generates are not optimal). If necessary, it is still possible to assign manually tensors to a specific backend with ggml_backend_sched_set_node_backend.

@ggerganov ggerganov merged commit 08d748b into master Oct 30, 2023
4 checks passed
@slaren slaren deleted the ggml-backend-v2 branch October 31, 2023 16:10
@ggerganov
Copy link
Owner

@slaren Planning to do one more sync between ggml and llama.cpp. We don't expect any issues with the new dynamic graphs, correct? I'll just apply the same changes as in this PR

@slaren
Copy link
Collaborator Author

slaren commented Nov 1, 2023

The training examples will need some changes, at least:

  • Using ggml_new_graph_custom to allocate larger than default graphs and to allocate grads
  • Using ggml_graph_cpy or ggml_graph_dup when copying graphs
  • Setting the graph size in ggml_opt_params

It will also require some testing to make sure that the changes to ggml-alloc don't break anything, but I don't expect any issues there.

@ggerganov
Copy link
Owner

I think only the graph copying in the training examples is left to update in the sync PR, but I'm not sure what it refers to.
Can you point me to it?

@slaren
Copy link
Collaborator Author

slaren commented Nov 2, 2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

ggml : remove GGML_MAX_NODES limit ggml : expose hash table API from ggml.c and reuse in ggml-alloc
2 participants