Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml backends interface v1 #547

Merged
merged 23 commits into from
Oct 6, 2023
Merged

ggml backends interface v1 #547

merged 23 commits into from
Oct 6, 2023

Conversation

slaren
Copy link
Collaborator

@slaren slaren commented Oct 3, 2023

Initial version of the common backends interface.

Supports CPU and CUDA with full offloading, but not partial offloading or fallback to the CPU backend for unimplemented ops.

Modifies the gpt-2 example to use the backends interface. If built with CUDA support and executed with -ngl 1, the CUDA backend is used.

Support for Metal added in #552

@ggerganov
Copy link
Owner

many of the changes in this PR are due to a sync with llama.cpp

I will do a sync of ggml with llama.cpp later tonight so we can focus on the relevant changes

@ggerganov
Copy link
Owner

@slaren Let's rebase on master so we can see the relevant diff

Copy link
Contributor

@iboB iboB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing something similar as the comment in ggml-cuda.h for all other backends, should be the goal IMO.

Introducing ggml to existing codebases would be easiest from the inside out as a plugin: first doing some leaves of the computation, then more and more towards the root.

@@ -41,6 +43,10 @@ GGML_API bool ggml_cuda_compute_forward(struct ggml_compute_params * params, s
GGML_API int ggml_cuda_get_device_count(void);
GGML_API void ggml_cuda_get_device_description(int device, char * description, size_t description_size);

// backend API
GGML_API ggml_backend_t ggml_backend_cuda_init(void); // TODO: take a list of devices to use
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, it will be nice if besides devices, one could provide stream handles and cublas handles here. This will allow plugging ggml in existing code.

I currently have hacked ggml_init_cublas as it is in master to allow exactly this. Thus I created a pytorch plugin which uses the cuda instance from pytorch and now I directly assign pytorch tensor pointers to ggml tensors.

I was planning to make my hack more presentable, but this PR is basically the way to go.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future, support for the old API will be removed, and then all the globals will be moved into the backend context. At that point we could consider how to add this functionality, but I think it is too early for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's wrong with using the globals?

If one wants to add a plugin, it is very unlikely for them to need multiple backend instances. And even if they do, this can be marked as "not supported yet".

If one wants to use the old behavior, they won't provide handles and it will be just the same.

No existing functionality is going to break if this feature is added with the PR. On the contrary: plugins like mine will be possible with vanilla ggml

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few problems with this:

  • The backends interface is still not finished, many things are going to change in the next weeks
  • I don't really like the idea exposing internals of the CUDA backend, especially at a point where things are going to change significantly
  • This looks like an extremely niche application, so I am not convinced that it needs to be merged. The patch can be maintained in the relevant repository.
  • I do not understand what are your requirements, so I cannot really do this myself in the first place

Copy link
Contributor

@iboB iboB Oct 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't say that it's a niche application. Consider an existing inference codebase which doesn't use ggml. Be it pytorch-based, pure cuda (FT, cuDNN, TRT...), tensorflow, or other. If one wants to migrate it to ggml for some reason, what can they do?

They can completely rewrite the inference in ggml which is quite error prone. Or they can implement parts of it in ggml. For example port parts of the computation graph and leave the rest as is. Gradually moving to entirely ggml-based inference from the inside out while having a working version at each step.

This is exactly my use case. I'm moving a pytorch-based application to ggml.

I think this can be quite beneficial for the library and people who want to evaluate it without having to go all the way.

Copy link
Collaborator Author

@slaren slaren Oct 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a function to change the main stream work? That would allow you to queue the ggml-cuda computations on the pytorch stream, which may improve performance due to reduced synchronization. Is there any reason to re-use the pytorch cuBLAS handle? Does each cuBLAS handle use a large amount of GPU resources?

@slaren slaren marked this pull request as ready for review October 4, 2023 13:26
@ggerganov
Copy link
Owner

Need to log off earlier today - will review tomorrow.
Is this branch in a state where I can attempt to add a Metal backend?

@slaren
Copy link
Collaborator Author

slaren commented Oct 4, 2023

The branch should be stable, I think this is already good enough for a v1.

It should be easier to implement than the last time. There is a pointer to the buffer in the tensors, and that should help with backends that cannot abstract a buffer+offset into a pointer, like Metal. Views inherit this buffer from the parent automatically. But there are very likely going to be some rough edges. So I would say to not do it yet, unless you are willing to look into the internals and change anything that may be required.

v2 will introduce partial offloading. Fallback to CPU will come at the same time, or a little later. After fallback to CPU is implemented, I will implement support for the OpenCL backend. I think that Metal is closer to OpenCL than to CUDA, so it may be in a better state to implement Metal support after that point.

examples/gpt-2/main.cpp Outdated Show resolved Hide resolved
examples/gpt-2/main.cpp Outdated Show resolved Hide resolved
ggerganov and others added 3 commits October 5, 2023 15:50
* ggml-backend : code style suggestions

* ggml-backend : move ggml_backend and ggml_backend_buffer in the source file

* ggml-backend : move structs back to header + rename type

* ggml-backend : remove obsolete comment

* fix leak in ggml_backend_buffer_free

* ggml-backend : re-introduce typedefs as a declaration of intent

---------

Co-authored-by: slaren <[email protected]>
examples/gpt-2/main.cpp Outdated Show resolved Hide resolved
* ggml-backend : metal (WIP)

* ggml-backend : metal (adapt CPU backend)

* ggml-backend : working metal

* ggml-backend : clean-up metal implementation

* ggml-backend : add ggml_backend_is_metal()
Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge this - I think it is a good addition to the framework

examples/gpt-2/main.cpp Outdated Show resolved Hide resolved
examples/gpt-2/main.cpp Outdated Show resolved Hide resolved
src/ggml-cuda.cu Outdated Show resolved Hide resolved
@slaren slaren merged commit fc9e955 into master Oct 6, 2023
4 checks passed
@slaren slaren deleted the ggml-backend branch October 6, 2023 16:51
CCLDArjun pushed a commit to CCLDArjun/ggml that referenced this pull request Dec 18, 2023
…ions correctly with `vocab_only` setting. Also confirmed that the code works as expected after running with reduced memory usage due to deletion of no-longer-needed variable. (ggerganov#547)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

None yet

5 participants