ggml backends interface v1 #547

slaren · 2023-10-03T00:21:57Z

Initial version of the common backends interface.

Supports CPU and CUDA with full offloading, but not partial offloading or fallback to the CPU backend for unimplemented ops.

Modifies the gpt-2 example to use the backends interface. If built with CUDA support and executed with -ngl 1, the CUDA backend is used.

Support for Metal added in #552

ggerganov · 2023-10-03T15:27:03Z

many of the changes in this PR are due to a sync with llama.cpp

I will do a sync of ggml with llama.cpp later tonight so we can focus on the relevant changes

ggerganov · 2023-10-04T12:53:46Z

@slaren Let's rebase on master so we can see the relevant diff

iboB

Doing something similar as the comment in ggml-cuda.h for all other backends, should be the goal IMO.

Introducing ggml to existing codebases would be easiest from the inside out as a plugin: first doing some leaves of the computation, then more and more towards the root.

iboB · 2023-10-04T12:55:55Z

src/ggml-cuda.h

@@ -41,6 +43,10 @@ GGML_API bool ggml_cuda_compute_forward(struct ggml_compute_params * params, s
 GGML_API int ggml_cuda_get_device_count(void);
 GGML_API void ggml_cuda_get_device_description(int device, char * description, size_t description_size);

+// backend API
+GGML_API ggml_backend_t ggml_backend_cuda_init(void); // TODO: take a list of devices to use


Hey, it will be nice if besides devices, one could provide stream handles and cublas handles here. This will allow plugging ggml in existing code.

I currently have hacked ggml_init_cublas as it is in master to allow exactly this. Thus I created a pytorch plugin which uses the cuda instance from pytorch and now I directly assign pytorch tensor pointers to ggml tensors.

I was planning to make my hack more presentable, but this PR is basically the way to go.

In the future, support for the old API will be removed, and then all the globals will be moved into the backend context. At that point we could consider how to add this functionality, but I think it is too early for that.

What's wrong with using the globals?

If one wants to add a plugin, it is very unlikely for them to need multiple backend instances. And even if they do, this can be marked as "not supported yet".

If one wants to use the old behavior, they won't provide handles and it will be just the same.

No existing functionality is going to break if this feature is added with the PR. On the contrary: plugins like mine will be possible with vanilla ggml

There are a few problems with this:

The backends interface is still not finished, many things are going to change in the next weeks

I don't really like the idea exposing internals of the CUDA backend, especially at a point where things are going to change significantly

This looks like an extremely niche application, so I am not convinced that it needs to be merged. The patch can be maintained in the relevant repository.

I do not understand what are your requirements, so I cannot really do this myself in the first place

I wouldn't say that it's a niche application. Consider an existing inference codebase which doesn't use ggml. Be it pytorch-based, pure cuda (FT, cuDNN, TRT...), tensorflow, or other. If one wants to migrate it to ggml for some reason, what can they do?

They can completely rewrite the inference in ggml which is quite error prone. Or they can implement parts of it in ggml. For example port parts of the computation graph and leave the rest as is. Gradually moving to entirely ggml-based inference from the inside out while having a working version at each step.

This is exactly my use case. I'm moving a pytorch-based application to ggml.

I think this can be quite beneficial for the library and people who want to evaluate it without having to go all the way.

Would a function to change the main stream work? That would allow you to queue the ggml-cuda computations on the pytorch stream, which may improve performance due to reduced synchronization. Is there any reason to re-use the pytorch cuBLAS handle? Does each cuBLAS handle use a large amount of GPU resources?

ggerganov · 2023-10-04T14:32:58Z

Need to log off earlier today - will review tomorrow.
Is this branch in a state where I can attempt to add a Metal backend?

slaren · 2023-10-04T14:53:18Z

The branch should be stable, I think this is already good enough for a v1.

It should be easier to implement than the last time. There is a pointer to the buffer in the tensors, and that should help with backends that cannot abstract a buffer+offset into a pointer, like Metal. Views inherit this buffer from the parent automatically. But there are very likely going to be some rough edges. So I would say to not do it yet, unless you are willing to look into the internals and change anything that may be required.

v2 will introduce partial offloading. Fallback to CPU will come at the same time, or a little later. After fallback to CPU is implemented, I will implement support for the OpenCL backend. I think that Metal is closer to OpenCL than to CUDA, so it may be in a better state to implement Metal support after that point.

examples/gpt-2/main.cpp

* ggml-backend : code style suggestions * ggml-backend : move ggml_backend and ggml_backend_buffer in the source file * ggml-backend : move structs back to header + rename type * ggml-backend : remove obsolete comment * fix leak in ggml_backend_buffer_free * ggml-backend : re-introduce typedefs as a declaration of intent --------- Co-authored-by: slaren <[email protected]>

examples/gpt-2/main.cpp

* ggml-backend : metal (WIP) * ggml-backend : metal (adapt CPU backend) * ggml-backend : working metal * ggml-backend : clean-up metal implementation * ggml-backend : add ggml_backend_is_metal()

ggerganov

Let's merge this - I think it is a good addition to the framework

examples/gpt-2/main.cpp

src/ggml-cuda.cu

…ions correctly with `vocab_only` setting. Also confirmed that the code works as expected after running with reduced memory usage due to deletion of no-longer-needed variable. (ggerganov#547)

ggerganov mentioned this pull request Oct 4, 2023

sync : llama.cpp (training, refactoring) #548

Merged

iboB suggested changes Oct 4, 2023

View reviewed changes

slaren added 6 commits October 4, 2023 15:09

ggml backends interface v1

883f0bc

move get_alignment from buffer to backend

c05714f

ggml-cuda : fix ggml_cuda_op_mul_mat_vec_q

da82697

gpt-2 : better check for CPU backend when settings n_threads

3cf87a3

.gitignore : add .clangd

319b4bc

merge master

b527d48

slaren force-pushed the ggml-backend branch from d7d2a6b to b527d48 Compare October 4, 2023 13:21

restore tests/CMakeLists.txt

abf2669

slaren marked this pull request as ready for review October 4, 2023 13:26

ggerganov reviewed Oct 5, 2023

View reviewed changes

examples/gpt-2/main.cpp Outdated Show resolved Hide resolved

YavorGIvanov reviewed Oct 5, 2023

View reviewed changes

examples/gpt-2/main.cpp Outdated Show resolved Hide resolved

YavorGIvanov reviewed Oct 5, 2023

View reviewed changes

examples/gpt-2/main.cpp Outdated Show resolved Hide resolved

ggerganov approved these changes Oct 5, 2023

View reviewed changes

examples/gpt-2/main.cpp Outdated Show resolved Hide resolved

ggerganov and others added 3 commits October 5, 2023 15:50

gpt-2 : add comments about KV allocation

c4cd2d7

add ggml_backend_is_cpu

d8b3efc

slaren commented Oct 5, 2023

View reviewed changes

examples/gpt-2/main.cpp Outdated Show resolved Hide resolved

slaren commented Oct 5, 2023

View reviewed changes

examples/gpt-2/main.cpp Show resolved Hide resolved

add backend check to ggml_backend_cpu_set_n_threads

3dbc43a

YavorGIvanov reviewed Oct 5, 2023

View reviewed changes

examples/gpt-2/main.cpp Show resolved Hide resolved

slaren added 4 commits October 5, 2023 16:13

backend cpu: fix buffer alignment

b74ffd5

fix CUDA_ARCHITECTURES for mmq

b4ec978

ggml-alloc : better handle view initialization

25ce18a

ggml-cuda : fix padding clearing

b42e19c

ggerganov added 4 commits October 6, 2023 10:51

ggml-backend : metal (#552)

94b0529

* ggml-backend : metal (WIP) * ggml-backend : metal (adapt CPU backend) * ggml-backend : working metal * ggml-backend : clean-up metal implementation * ggml-backend : add ggml_backend_is_metal()

gpt-2 : take advantage of Metal unified memory

ce797df

gpt-2 : remove TODO + update comment

e8bc940

ggml-backend : fix ggml_backend_is_xxx() interface

5ca14ce

ggerganov approved these changes Oct 6, 2023

View reviewed changes

examples/gpt-2/main.cpp Outdated Show resolved Hide resolved

slaren commented Oct 6, 2023

View reviewed changes

examples/gpt-2/main.cpp Outdated Show resolved Hide resolved

gpt-2 : fix build

b22916c

slaren commented Oct 6, 2023

View reviewed changes

src/ggml-cuda.cu Outdated Show resolved Hide resolved

ggml-cuda : cleanup, fix case for src1 not contiguous

01710cc

ggerganov mentioned this pull request Oct 6, 2023

make ggml_conv_2d faster #483

Merged

slaren added 2 commits October 6, 2023 18:24

remove commented code

9cb2626

rename ggml_backend_metal_set_n_threads to n_cb

1ad7c5e

slaren merged commit fc9e955 into master Oct 6, 2023
4 checks passed

slaren deleted the ggml-backend branch October 6, 2023 16:51

ggerganov mentioned this pull request Oct 8, 2023

ggml : adapt Metal to new ggml_backend interface ggerganov/llama.cpp#2258

Closed

ggerganov mentioned this pull request Nov 3, 2023

sync : ggml (backend v2) ggerganov/llama.cpp#3912

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml backends interface v1 #547

ggml backends interface v1 #547

slaren commented Oct 3, 2023 •

edited

Loading

ggerganov commented Oct 3, 2023

ggerganov commented Oct 4, 2023

iboB left a comment

iboB Oct 4, 2023

slaren Oct 4, 2023

iboB Oct 4, 2023

slaren Oct 4, 2023

iboB Oct 4, 2023 •

edited

Loading

slaren Oct 5, 2023 •

edited

Loading

ggerganov commented Oct 4, 2023

slaren commented Oct 4, 2023 •

edited

Loading

ggerganov left a comment

ggml backends interface v1 #547

ggml backends interface v1 #547

Conversation

slaren commented Oct 3, 2023 • edited Loading

ggerganov commented Oct 3, 2023

ggerganov commented Oct 4, 2023

iboB left a comment

Choose a reason for hiding this comment

iboB Oct 4, 2023

Choose a reason for hiding this comment

slaren Oct 4, 2023

Choose a reason for hiding this comment

iboB Oct 4, 2023

Choose a reason for hiding this comment

slaren Oct 4, 2023

Choose a reason for hiding this comment

iboB Oct 4, 2023 • edited Loading

Choose a reason for hiding this comment

slaren Oct 5, 2023 • edited Loading

Choose a reason for hiding this comment

ggerganov commented Oct 4, 2023

slaren commented Oct 4, 2023 • edited Loading

ggerganov left a comment

Choose a reason for hiding this comment

slaren commented Oct 3, 2023 •

edited

Loading

iboB Oct 4, 2023 •

edited

Loading

slaren Oct 5, 2023 •

edited

Loading

slaren commented Oct 4, 2023 •

edited

Loading