[wip] ggml-backend v2 : add ggml_backend_sched #586

slaren · 2023-10-18T20:30:45Z

Adds support for computing graphs using multiple backends through ggml_backend_sched.

Currently, this allows partially offloading models to the GPU, and establishes the framework necessary for more advanced uses, such as:

Automatic fallback to the CPU backend for ops unimplemented in the GPU backends
Using any combination of backends, for example some layers could be offloaded with CUDA and others with OpenCL
Using multiple CUDA or OpenCL devices by splitting the computation at a layer or tensor level

Basic functionality should be working now, but still needs some work, especially in the graph splitting logic.

Fixes #549 #567

slaren · 2023-10-19T13:34:32Z

Currently, ggml_backend_sched allocates a graph for every backend split and copies the range of operations corresponding to the split from the original graph. This not only is not very efficient, but also wastes a large amount of memory due to the size of the ggml_cgraph. ggml_backend_sched objects are 160MB due to this.

I can see two ways to address this:

Modify the graph_compute functions to take the indices of a range of nodes to evaluate
Modify ggml_cgraph for dynamic allocation, then we could reuse the nodes pointer from the original graph

The second option is probably the right one, but current code would need to be updated to use ggml_new_graph or similar. We would also need to the decide on the implementation of dynamically sized ggml_cgraph, which is not very clear at the moment.

ggerganov · 2023-10-19T15:19:01Z

Let's discuss how to implement the second option - I've sketched some rough notes here: #567 (comment)

include/ggml/ggml.h

src/ggml.c

add ggml_opt_params::graph_size add ggml_new_graph_custom, ggml_graph_overhead_custom add ggml_graph_clear

ggml-ci

ggerganov · 2023-10-30T18:37:09Z

examples/gpt-2/main.cpp

+        // initialize the scheduler
+        sched = ggml_backend_sched_new(model.backends.data(), model.backends.size());

        // create the worst case graph for memory usage estimation
        int n_tokens = std::min(model.hparams.n_ctx, params.n_batch);
        int n_past = model.hparams.n_ctx - n_tokens;
-        struct ggml_cgraph * gf = gpt2_graph(model, allocr, n_past, std::vector<gpt_vocab::id>(n_tokens, 0));
+        struct ggml_cgraph * gf = gpt2_graph(model, n_past, std::vector<gpt_vocab::id>(n_tokens, 0));

-        // compute the required memory
-        size_t mem_size = ggml_allocr_alloc_graph(allocr, gf);
+        ggml_backend_sched_init_measure(sched, gf);


This also works, correct? Just for my understanding.

diff --git a/examples/gpt-2/main.cpp b/examples/gpt-2/main.cpp index 1a24925..a62a2b9 100644 --- a/examples/gpt-2/main.cpp +++ b/examples/gpt-2/main.cpp @@ -956,13 +956,13 @@ int main(int argc, char ** argv) { ggml_backend_sched_t sched; { // initialize the scheduler - sched = ggml_backend_sched_new(model.backends.data(), model.backends.size()); - // create the worst case graph for memory usage estimation int n_tokens = std::min(model.hparams.n_ctx, params.n_batch); int n_past = model.hparams.n_ctx - n_tokens; struct ggml_cgraph * gf = gpt2_graph(model, n_past, std::vector<gpt_vocab::id>(n_tokens, 0)); + sched = ggml_backend_sched_new(model.backends.data(), model.backends.size()); + ggml_backend_sched_init_measure(sched, gf);

Yes, that should work since it pre-allocates the inputs from a different buffer, so the gpt2_graph function doesn't need the sched object.

However, the previous pattern of allocating the inputs manually by calling ggml_allocr_alloc is still supported by using ggml_backend_sched_get_tallocr to obtain the allocators for each backend. In that case, the sched object would need to be created before the graph.

ggerganov · 2023-10-30T18:54:24Z

include/ggml/ggml-backend.h

+    // The backend scheduler allows for multiple backends to be used together
+    // Handles compute buffer allocation, assignment of tensors to backends, and copying of tensors between backends
+    // The backends are selected based on:
+    // - the backend that supports the operation


the backend that supports the operation

Can you explain how this part works? How does the scheduler know which ops are supported

This isn't implemented yet (this is the automatic fallback to the CPU backend), but it will do this by calling the supports_op function of the backend interface.

ggerganov · 2023-10-30T19:06:59Z

Wow, this looks very powerful. I'm still going through the code, but overall I really like what I see.

Using multiple CUDA or OpenCL devices by splitting the computation at a layer or tensor level

I understand the multi-device layer split, but how do you imagine tensor level splits to work?

I see that #549 #567 are resolved which is great. What is the next thing that should be resolved for the backend interface to function better / be easier to use?

Shall we look into merging and using this in llama.cpp soon?

slaren · 2023-10-30T19:21:46Z

I understand the multi-device layer split, but how do you imagine tensor level splits to work?

I think that we are talking about the same thing here. What I mean by "tensor level" is that each tensor can be individually assigned to a different backend or device. What I think you mean here is what I would call splitting at the row level, which is what ggml-cuda does currently. I don't think that this can be realistically done with a generic backend interface because the synchronization needs to be implemented at a lower level than what ggml-backend supports.

Shall we look into merging and using this in llama.cpp soon?

It shouldn't take long, but there are still a few steps missing before this can be used in llama.cpp, at least:

Automatic fallback to CPU support
ggml-opencl support
Improved graph splitting logic
Support for the current row-level multi GPU implementation of ggml-cuda in ggml-backend (note this will be specific to ggml-cuda, not generic for any backend)

I would also like to implement a backend registry so that applications can query the list of available backends in a generic way. Ideally this would allow building llama.cpp with both CUDA and OpenCL and check availability at run time, and even mix CUDA and OpenCL devices.

ggerganov · 2023-10-30T19:33:02Z

The refactoring in ggerganov/llama.cpp#3837 should still be helpful, correct? At least it moves the CUDA specifics outside the build functions which is inline with this interface. The cb() stuff and name lookups can easily be removed when we no longer need them.

slaren · 2023-10-30T19:42:05Z

I think most of it will still be useful. Ideally, the graph splitting logic should be good enough that we won't need to assign tensors to a specific backend (and I think that's possible to implement, but it isn't quite there yet, the splits that it generates are not optimal). If necessary, it is still possible to assign manually tensors to a specific backend with ggml_backend_sched_set_node_backend.

ggerganov · 2023-11-01T13:52:14Z

@slaren Planning to do one more sync between ggml and llama.cpp. We don't expect any issues with the new dynamic graphs, correct? I'll just apply the same changes as in this PR

slaren · 2023-11-01T14:05:53Z

The training examples will need some changes, at least:

Using ggml_new_graph_custom to allocate larger than default graphs and to allocate grads
Using ggml_graph_cpy or ggml_graph_dup when copying graphs
Setting the graph size in ggml_opt_params

It will also require some testing to make sure that the changes to ggml-alloc don't break anything, but I don't expect any issues there.

ggerganov · 2023-11-02T18:08:20Z

I think only the graph copying in the training examples is left to update in the sync PR, but I'm not sure what it refers to.
Can you point me to it?

slaren · 2023-11-02T18:15:39Z

To perform a deep copy of a graph now we need to use ggml_graph_cpy, as the operator = now only copies the pointers. I think these need to be updated to use ggml_graph_cpy:

https://github.com/ggerganov/llama.cpp/blob/16e819d53ce5bb7025a545bd45b6404c16f3d432/examples/finetune/finetune.cpp#L775
https://github.com/ggerganov/llama.cpp/blob/16e819d53ce5bb7025a545bd45b6404c16f3d432/examples/train-text-from-scratch/train-text-from-scratch.cpp#L439

slaren changed the title ~~ggml-backend v2 : add ggml_backend_sched~~ [wip] ggml-backend v2 : add ggml_backend_sched Oct 18, 2023

ggerganov self-requested a review October 18, 2023 21:09

slaren force-pushed the ggml-backend-v2 branch 2 times, most recently from 0f80411 to 4b319cd Compare October 18, 2023 21:43

ggerganov mentioned this pull request Oct 19, 2023

ggml : remove GGML_MAX_NODES limit #567

Closed

slaren commented Oct 20, 2023

View reviewed changes

include/ggml/ggml.h Outdated Show resolved Hide resolved

slaren commented Oct 20, 2023

View reviewed changes

src/ggml.c Outdated Show resolved Hide resolved

slaren force-pushed the ggml-backend-v2 branch from 2257cf7 to 719f08d Compare October 20, 2023 23:21

slaren added 14 commits October 22, 2023 01:25

ggml-backend-v2 wip

077a9a4

fix metal build

af48025

ggml-alloc : use a real backend buffer in measure mode

d35ea27

backend sched : ignore view ops to reduce the number of splits

9ac3256

dynamic ggml_cgraph wip

25fa544

dyn graphs : remove n_tasks from ggml_cplan

45d8936

dyn graphs : update ggml_graph_import

4429aea

reset hash table in ggml_build_forward

827665e

ggml-alloc : split into tensor and graph allocators

d38258f

add ggml_backend_sched_set_node_backend

ddea127

remove ggml_build_forward_ctx, ggml_build_backward_ctx

43a7624

add ggml_opt_params::graph_size add ggml_new_graph_custom, ggml_graph_overhead_custom add ggml_graph_clear

update examples and tests, fix issues

5948c23

update more examples

62a06c6

update gpt-2/main-backend.cpp from master

2105a49

slaren force-pushed the ggml-backend-v2 branch from 96e56b4 to 2105a49 Compare October 21, 2023 23:33

slaren marked this pull request as ready for review October 21, 2023 23:36

slaren linked an issue Oct 21, 2023 that may be closed by this pull request

ggml : remove GGML_MAX_NODES limit #567

Closed

slaren mentioned this pull request Oct 28, 2023

stable-diffusion : ggml-alloc integration and gpu acceleration leejet/stable-diffusion.cpp#75

Merged

5 tasks

ggerganov added 2 commits October 30, 2023 18:47

Merge branch 'master' into ggml-backend-v2

5cdc4b3

ggml : fix copmile warning

63bff39

ggerganov added 5 commits October 30, 2023 19:03

Merge branch 'master' into ggml-backend-v2

178a1d3

ci : update yolo, fix mnist, use gpt-2-backend

88c784a

ggml : fix uninit warning

9dba67a

ci : switch to gpt-2-backend2

6a7bdc5

ggml-ci

metal : skip noops early to avoid warnings from ggml_metal_get_buffer

2de28e1

ggerganov reviewed Oct 30, 2023

View reviewed changes

ggerganov approved these changes Oct 30, 2023

View reviewed changes

ggerganov merged commit 08d748b into master Oct 30, 2023
4 checks passed

slaren deleted the ggml-backend-v2 branch October 31, 2023 16:10

slaren mentioned this pull request Oct 31, 2023

Can't we use multiple GPUs independently? ggerganov/llama.cpp#2165

Closed

saharNooby mentioned this pull request Nov 2, 2023

llama : add RWKV models support ggerganov/llama.cpp#846

Closed

ggerganov mentioned this pull request Nov 3, 2023

sync : ggml (backend v2) ggerganov/llama.cpp#3912

Merged

YavorGIvanov mentioned this pull request Dec 4, 2023

Unable to attach with GDB when hitting GGML_ASSERT after backend v2 changes #630

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[wip] ggml-backend v2 : add ggml_backend_sched #586

[wip] ggml-backend v2 : add ggml_backend_sched #586

slaren commented Oct 18, 2023 •

edited

Loading

slaren commented Oct 19, 2023

ggerganov commented Oct 19, 2023

ggerganov Oct 30, 2023

slaren Oct 30, 2023 •

edited

Loading

ggerganov Oct 30, 2023

slaren Oct 30, 2023

ggerganov commented Oct 30, 2023

slaren commented Oct 30, 2023

ggerganov commented Oct 30, 2023

slaren commented Oct 30, 2023

ggerganov commented Nov 1, 2023

slaren commented Nov 1, 2023

ggerganov commented Nov 2, 2023

slaren commented Nov 2, 2023 •

edited

Loading

[wip] ggml-backend v2 : add ggml_backend_sched #586

[wip] ggml-backend v2 : add ggml_backend_sched #586

Conversation

slaren commented Oct 18, 2023 • edited Loading

slaren commented Oct 19, 2023

ggerganov commented Oct 19, 2023

ggerganov Oct 30, 2023

Choose a reason for hiding this comment

slaren Oct 30, 2023 • edited Loading

Choose a reason for hiding this comment

ggerganov Oct 30, 2023

Choose a reason for hiding this comment

slaren Oct 30, 2023

Choose a reason for hiding this comment

ggerganov commented Oct 30, 2023

slaren commented Oct 30, 2023

ggerganov commented Oct 30, 2023

slaren commented Oct 30, 2023

ggerganov commented Nov 1, 2023

slaren commented Nov 1, 2023

ggerganov commented Nov 2, 2023

slaren commented Nov 2, 2023 • edited Loading

slaren commented Oct 18, 2023 •

edited

Loading

slaren Oct 30, 2023 •

edited

Loading

slaren commented Nov 2, 2023 •

edited

Loading