Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml : change ggml_graph_compute() API to not require context #1999

Merged
merged 20 commits into from
Jul 7, 2023

Conversation

mqy
Copy link
Collaborator

@mqy mqy commented Jun 26, 2023

Try resolve ggerganov/ggml#287

EDIT: see latest update at the end.

### Intro

The design is a bit different to the suggested one: named the buffer type as a generalized one:ggml_cgraph_context.

struct ggml_cgraph_context {
    size_t work_size;
    void * work_data;
    bool   planned; // true means ready to compute graph nodes.
};

I'll explain planned later, let's focus on the APIs.

The first parameter ctx of ggml_graph_compute() is deprecated (pass in NULL is OK).
Removed wsize and work from ggml_cgraph, unlikely break external users because no reason to use them directly.

To avoid break external users, can not simply change the signature of ggml_graph_compute(), have to add ggml_graph_compute_v2(), the name looks weird but this is the reality :/ Now ggml_graph_compute() is a thin wrapper of `ggml_graph_compute_v2().

Extracted codes to new function ggml_graph_compute_plan(): set node->n_tasks, calculate work_size.

Usage

struct ggml_cgraph_context ctx = {0};

// case 1:
ggml_graph_compute_plan(&ctx, cgraph);
if (ctx.work_size > 0) {
    ctx.work_data = malloc(ctx.work_size);
}
ggml_graph_compute_v2(&ctx, cgraph);

// case 2:
ggml_graph_compute_v2(&ctx, cgraph);

// case 3:
ggml_graph_compute_v2(NULL, cgraph);

// case 4:
ggml_graph_compute(ggml_context *ctx, cgraph);

// case 5:
ggml_graph_compute(NULL, cgraph);

Why did I add the field planned?

Because:

  • The ctx is allowed to be NULL, empty or initialized by ggml_graph_compute_plan().
  • One of the corner cases is: the work_size and work_data can be default values even if the plan has been called. So we can not simply determine whether or not call ggml_graph_compute_plan() by default values.
  • ggml_graph_compute_plan() MUST be called because it also sets node->n_tasks. The work_size depends on n_tasks.

The planned makes plan-compute sequence stateful, not good enough. Any ideas?

Update on JUL 3

  1. No longer consider backward compatibility, this means the plan phase MUST be executed before compute. And gml_graph_compute() no longer responsible for creating any kind of buffer. @ggerganov
  2. Removed n_tasks from ggml_tensor; removed n_threads from ggml_cgraph. @slaren
  3. the struct ggml_graph_compute_plan implies that it should be initialized or created by some procedure. @howard0su

Usage:

struct ggml_graph_compute_plan plan = ggml_graph_compute_make_plan(gb, params.n_threads);
if (plan.work_size > 0) {
    plan.work_data = malloc(plan.work_size);
    GGML_ASSERT(plan.work_data);
}
ggml_graph_compute(&plan, gb);
if (plan.work_data) {
    free(plan.work_data);
}

I tested main and perplexity.

Update JUL 6

🤖 Generated by Copilot at 551ed08

Summary

🛠️📚🚀

This pull request improves the performance and usability of the ggml_graph_compute function by introducing a new ggml_cplan structure and a helper function ggml_graph_compute_helper. It also adds support for specifying the number of command buffers for the Metal context, and updates the examples, tests, and workflows to use the new API and settings.

This pull request makes some changes
To ggml_graph_compute and friends
It adds a new plan
And some Metal commands
And fixes some warnings and ends

Walkthrough

  • Change the ggml_graph_compute function to require a ggml_cplan argument and add new functions ggml_graph_plan and ggml_graph_compute_with_ctx to support the new API (link)
  • Add a helper function ggml_graph_compute_helper to wrap the logic of creating and using a ggml_cplan structure and update the example code in ggml.h to use the new API (link, link, link, link, link)
  • Remove the fields n_threads, n_tasks, and work from the ggml_cgraph and ggml_tensor structures and add them to the ggml_cplan structure (link, link, link)
  • Update the llama_eval_internal function to use the new API and adjust the number of threads for the BLAS settings (link, link, link, link)
  • Add a parameter n_cb to the ggml_metal_init function and a function ggml_metal_set_n_cb to control the number of command buffers for the Metal context and update the ggml_metal_graph_compute function to use the n_cb field instead of the n_threads field (link, link, link, link, link)
  • Update the metal.cpp example to pass the number of threads to the ggml_metal_init function (link)
  • Update the baby-llama.cpp, benchmark-matmult.cpp, and train-text-from-scratch.cpp examples to use the new API and remove the assignments of gf.n_threads (link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link)
  • Update the llama.cpp file to use the new API and pass the number 1 to the ggml_metal_init function (link, link, link, link, link, link, link, link, link)
  • Enable the test-grad0.c test and update it to use the new API and ignore the double promotion warning (link, link, link, link, link, link, link)
  • Update the test-opt.c test to use the new API and ignore the double promotion warning (link, link, link, link, link)
  • Add a timeout option to the ctest command for the GitHub workflow jobs to avoid the tests from hanging or running too long and add a new environment variable GGML_NLOOP to control the number of iterations for the benchmark tests (link, link, link, link, link)

@slaren
Copy link
Collaborator

slaren commented Jun 26, 2023

ggml_graph_compute_plan() MUST be called because it also sets node->n_tasks. The work_size depends on n_tasks.

I think that n_tasks should be removed from ggml_tensor. For now, the easiest way to address may be to move it to the new ggml_cgraph_context, but eventually I think that this is something that ggml_graph_compute should be able to determine automatically when executing a node.

Good job!

@mqy
Copy link
Collaborator Author

mqy commented Jun 26, 2023

I think that n_tasks should be removed from ggml_tensor.

Of course, n_tasks should belong to the compute facility I think, it's ideal to migrate to some place else.

So, create another data structure for example tensor_compute_config, 1:1 mapping to graph nodes, place them inside ggml_cgraph_context, during computing nodes, set params.nth as the n_tasks from the sibling i-th config object. It is Right?

@slaren
Copy link
Collaborator

slaren commented Jun 26, 2023

Of course, n_tasks should belong to the compute facility I think, it's ideal to migrate to some place else.

Yes precisely! That's what I was thinking as well.

I am not sure that we need a new struct for this, is there any other data from the tensor would belong there? I am thinking that adding a simple int n_tasks[GGML_MAX_NODES] to ggml_cgraph_context with a 1:1 mapping between the graph nodes should do it. It is not great, but for now it will solve the problem, and eventually I think that we can remove it and have ggml_graph_compute determine this automatically instead.

@mqy
Copy link
Collaborator Author

mqy commented Jun 26, 2023

is there any other data from the tensor would belong there?

No way.

adding a simple int n_tasks[GGML_MAX_NODES]

Great! The n_tasks array is good enough.

and eventually I think that we can remove it and have ggml_graph_compute determine this automatically instead.

That's will be a long story. I had tried add another enum to ggml_task_type named GGML_TASK_CONFIG, ggml_graph_compute() runs the planning before computing by calling ggml_graph_forward() for every node, similar to the planning idea in this PR. At that time I prefer this kind of design because actually the computing codes should know exactly what they can provide (parallel?) and want what are required (wsize?). That's easy to maintain-- just have a glance at the actual codes, less chances to mis-config.

I think ggml_graph_compute() MAY NOT necessary to hard code everything in the long run, it's better to build some kinds of new protocol/contract between the scheduler and actual computing codes. To achieve this, we can construct another slightly abstract data structure of on top of {node, params}, with methods such as config(), compute(). With this design, we could create and register several compute vendors (with id/name for example) for every kind of node, this provide great flexibility for evaluating experimental algorithms, benchmarking. With the compute vendor id, we are hopefully able to define a compute plan offline (if this is unnecessary, at least we can specify certain vendors) , this is similar to the dumped cgraph.

Just some random thoughts :D

@slaren
Copy link
Collaborator

slaren commented Jun 26, 2023

That's very similar to what I have been thinking. I am working on a CUDA implementation that can execute ggml_cgraphs directly, and what it needs to do that is very similar to this, a method to "prepare" or "plan" the graph, and another to execute the "prepared" graph. Eventually, we will want a common interface to all compute backends so that users can use any of them interchangeably, without having to add specific code for each one.

While doing this, I think that we should also remove n_threads from ggml_cgraph, since this parameter is only relevant to the CPU compute backend and not applicable to the other compute backends.

@mqy
Copy link
Collaborator Author

mqy commented Jun 26, 2023

Eventually, we will want a common interface to all compute backends so that users can use any of them interchangeably, without having to add specific code for each one.

Yes, I saw the changes during the last two months, getting better and better. Both CUDA and CL implementations impose potential de-coupling requirements.

While doing this, I think that we should also remove n_threads from ggml_cgraph, since this parameter is only relevant to the CPU compute backend and not applicable to the other compute backends.

Sure, similar to n_tasks in tensor as well, but no idea where to put it when compute, hello ggml_cgraph_context?

At present, I'm not sure but looks like for cross cgraph computing, the n_threads is still required, especially when a node runner is allowed to hijack the computing flow or fallback to default vendor -- the CUDA way. In cross cgraph context, n_threads or thread_ctx should be owned by something like compute session?

I think it's OK to leave n_threads there for a while, GPU codes can simply ignore it, unless the GPU implementation has it's complete cgraph that is not shared with CPU, this is yet another long story.

@slaren
Copy link
Collaborator

slaren commented Jun 26, 2023

What I was thinking is that n_threads could be a parameter to ggml_graph_compute_plan, and it would also be stored in ggml_cgraph_context for use by ggml_graph_compute.

For now, the CUDA runner can still operate in the same way unaffected by this, but in the long run this method of hijacking the compute flow should be removed. Mixed CPU/GPU execution can be achieved in other ways, such as by splitting the computation into multiple graphs, each of them executed by a different backend.

@mqy mqy force-pushed the ggml_graph_compute_context branch 2 times, most recently from d9876af to 2a773ce Compare June 27, 2023 05:38
@howard0su
Copy link
Collaborator

howard0su commented Jun 29, 2023

I like the idea. However, I am not favoring the implementation:

  1. We should not rely on consumer's behavior says don't touch plan. We should not expose cgraph_compute_context at all or maintain the state elsewhere.
  2. a common API pattern for external provided buffer is like: Pass NULL as buf, and buffer size into API. API will return the size it expected but don't do the real computation. Pass a valid pointer but the size is less than expected, the function will fail in the same way.
  3. plan is not a clear to indicate this function needs to be called before compute API. compute_v2 takes a result of this plan will make the API explict. also maybe create compute_context is better API name. plan is too general.

@mqy
Copy link
Collaborator Author

mqy commented Jun 29, 2023

I am not favoring the implementation

Yes, totally make points. A general context is used for future extensibility, but should be seen as over design. If you have read previous comments you would get the conclusion that a general context tends be "extended" as a basket.

So, if we can't tell what we want beyond current requirement, a specialized API should be the best choice.
Anyway, I don't think plan is a bad idea, the actual problem is we haven't decouple the calculating wsize from setting n_tasks. Perhaps we'd reconsider this design after splitting that part of codes.

@ggerganov ggerganov added high priority Very important issue refactoring Refactoring labels Jul 1, 2023
@ggerganov
Copy link
Owner

I think that n_tasks should be removed from ggml_tensor

Agree

Good points by @howard0su

Overall, I think this is on the right track and we should finalize the PR and merge it.

There is no need to worry about backwards compatibility (i.e. the "_v2" stuff is not necessary). Just assume that we have just started developing the API and nobody has been using it and expect it to be compatible. We will start to worry about this kind of things after the v1.0 release sometime in the future.

@mqy
Copy link
Collaborator Author

mqy commented Jul 3, 2023

There is no need to worry about backwards compatibility

This is great, I'll rewrite this PR, show you later.

ggml.c Outdated Show resolved Hide resolved
@mqy mqy force-pushed the ggml_graph_compute_context branch from 2a773ce to e052bc4 Compare July 3, 2023 08:01
tests/test-grad0.c Outdated Show resolved Hide resolved
@mqy mqy changed the title ggml_graph_compute: deprecate using ggml_context, try resolve issue #287 ggml: breaking change to ggml_graph_compute(): planning is required before computing Jul 3, 2023
@mqy mqy marked this pull request as draft July 3, 2023 10:11
tests/test-grad0.c Show resolved Hide resolved
@mqy
Copy link
Collaborator Author

mqy commented Jul 3, 2023

The PR description was updated. You may want to have a look at it. Apart from main and perplexity, also tested:

  • tests/test-grad0
  • tests/test-opt
  • examples/benchmark
  • examples/baby-llama
  • examples/train-text-from-scratch (force stop at Example 2, opt iter 29)

No crash, all of them output reasonable text.

So this PR is ready to review again.

@mqy mqy marked this pull request as ready for review July 3, 2023 12:59
@slaren
Copy link
Collaborator

slaren commented Jul 4, 2023

Looks good, I only have a few minor nits:

  • In llama.cpp, to avoid allocations in every eval, the work buffer memory could be stored as a std::vector in lama_context. Just resize it to the work_size, and if it is already big enough, it won't reallocate memory.
  • More generally, in all the C++ code, I would suggest always using std::vector instead of malloc/free
  • I am not sure about ggml_graph_compute_sugar

ggml.c Outdated Show resolved Hide resolved
@mqy
Copy link
Collaborator Author

mqy commented Jul 4, 2023

Thanks @slaren

  • In llama.cpp, to avoid allocations in every eval, the work buffer memory could be stored as a std::vector in lama_context. Just resize it to the work_size, and if it is already big enough, it won't reallocate memory.
  • More generally, in all the C++ code, I would suggest always using std::vector instead of malloc/free

Make sense, this is also one of my concerns. I'll try the std:vector way.

  • I am not sure about ggml_graph_compute_sugar

The sugar is used by baby-llama. It calls ggml_graph_compute again and again. I suggest we delay it to next PR.

@mqy mqy force-pushed the ggml_graph_compute_context branch from 62ec4b8 to bf63002 Compare July 4, 2023 19:10
@mqy
Copy link
Collaborator Author

mqy commented Jul 4, 2023

bf63002 is the latest commit per @slaren 's suggestion.

TODO: The sugar is used by baby-llama. It calls ggml_graph_compute again and again. I suggest we delay it to next PR.

Rebased on to master.

@mqy mqy force-pushed the ggml_graph_compute_context branch from bf63002 to b1331d7 Compare July 6, 2023 03:43
@mqy
Copy link
Collaborator Author

mqy commented Jul 6, 2023

rebased

@ggerganov
Copy link
Owner

I've refactored the changes:

  • add ggml_graph_compute_with_ctx() - similar to the old way of graph computation
  • rename struct ggml_graph_compute_plan to struct ggml_cplan
  • rename ggml_graph_compute_make_plan() to ggml_graph_plan()
  • factored common "plan + compute" code in a helper function ggml_graph_compute_helper()
  • enabled test-grad0 test
  • fixed Metal build

Overall the change is good.
A positive side-effect is that the user can now control the number of tasks for each op. This can be utilized also when creating custom operators, if we expose the struct ggml_compute_params to the custom function signatures.

@mqy @slaren Please take a look

@monatis
Copy link
Collaborator

monatis commented Jul 6, 2023

rename ggml_graph_compute_make_plan() to ggml_graph_plan()

I would suggest ggml_cplan_make() --both short as intended and also consistent with the struct naming.

@slaren
Copy link
Collaborator

slaren commented Jul 6, 2023

I think this looks good.

A positive side-effect is that the user can now control the number of tasks for each op. This can be utilized also when creating custom operators, if we expose the struct ggml_compute_params to the custom function signatures.

Eventually we should support custom ops on the GPU backends too. It will be slow since it will require copying data back and forth from the device, but that's still a lot better than not supporting them at all. So whenever we do this, I think we should do it in a backend-agnostic way, rather than tying it to the CPU backend.

@mqy
Copy link
Collaborator Author

mqy commented Jul 7, 2023

I've refactored the changes:

Looks good. Verified main and test-grad0.

A positive side-effect is that the user can now control the number of tasks for each op.

  • Increasing n_tasks[i] after plan may cause work buffer overflow.
  • For OPs that can only be run with n_tasks == 1, it's UB if run with n_tasks > 1.

Anyway, better than never.

@ggerganov ggerganov changed the title ggml: breaking change to ggml_graph_compute(): planning is required before computing ggml : change ggml_graph_compute() API Jul 7, 2023
@ggerganov
Copy link
Owner

ggerganov commented Jul 7, 2023

I would suggest ggml_cplan_make() --both short as intended and also consistent with the struct naming.

Good suggestion.
I like grouping names with common prefixes and currently ggml_graph_plan() fits neatly with the rest: ggml_graph_compute(), ggml_graph_reset() so I prefer not to isolate it in a new prefix just yet. If we start doing more stuff with the plans, like ggml_cplan_optimize(), ggml_cplan_print(), etc, we can rename it.

Btw, this reminds me that ggml_graph_xxx probably should have been ggml_cgraph_xxx. Shall we change it?

Increasing n_tasks[i] after plan may cause work buffer overflow.

Ah good point. So the developer has to be careful and modify n_tasks only for their own custom operators or be familiar with the inner workings of the ggml ops.

@ggerganov ggerganov changed the title ggml : change ggml_graph_compute() API ggml : change ggml_graph_compute() API to not require context Jul 7, 2023
@ggerganov ggerganov merged commit 1d656d6 into ggerganov:master Jul 7, 2023
9 of 21 checks passed
@mqy mqy deleted the ggml_graph_compute_context branch July 7, 2023 22:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority Very important issue refactoring Refactoring
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ggml : ggml_graph_compute should not require ggml_context
5 participants