Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync : llama.cpp (ggml-backend) #696

Merged
merged 7 commits into from
Jan 12, 2024
Merged

sync : llama.cpp (ggml-backend) #696

merged 7 commits into from
Jan 12, 2024

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Jan 12, 2024

We should resolve the build so we can merge the latest changes from llama.cpp here

ikawrakow and others added 4 commits January 12, 2024 21:14
* imatrix: 1st version

* imatrix: WIP

* Cleanup

* Update examples/imatrix/imatrix.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

---------

Co-authored-by: Iwan Kawrakow <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>
* llama : ggml-backend integration

* ggml-backend : add names to buffers

* fix unmap after loading

* batched-bench : add tensor_split param

* llama : check for null tensor_split

* ggml-backend : increase GGML_MAX_BACKENDS

* improve graph splitting, partial fix for --no-kv-offload

* cuda : add ggml-backend split buffer support

* cuda : do not create buffer types for devices that don't exist (fixes usage without CUDA devices available)

* ggml : fix null backend dereference (llama/4807)

* ggml : fix null backend dereference

* ggml : also check ggml_backend_is_cpu

* test-backend-ops : check buffer allocation failures

* llama : add cparam (split_mode) and command line argument (--split-mode, -sm) to configure the split mode (none, layer or row)

* ggml : fix mul_mat_id work size

* llama : rewrite session kv load/set without graphs

* minor

* llama : only initialize used backends, free backends on context free

* llama : abort ctx if cuda backend init fails

* llama : rewrite lora with ggml-backend and compute on CPU

ggml-ci

* llama : only map to a backend buffer the region of the file mapping containing the tensors used in the buffer

* opencl : add ggml-backend buffer type

* cuda : only use batched_cublas with batched mat muls (fixes fp16 tg perf)

* llama : on Metal, by default offload the full model

ggml-ci

* metal : page align the data ptr (llama/4854)

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <[email protected]>

* cuda : fix split buffer free

* address review comments

* llama-bench : add split-mode parameter

* fix whitespace

* opencl : fix double initialization

* server : add --split-mode parameter

* use async copy and compute to improve multi-gpu performance

ggml-ci

* use async memcpys to copy the graph outputs to the CPU

* fix opencl

* use a host buffer for the cpu compute buffer for faster copies to the gpu

---------

Co-authored-by: Georgi Gerganov <[email protected]>
Co-authored-by: Johannes Gäßler <[email protected]>
@ggerganov ggerganov requested a review from slaren January 12, 2024 19:16
@ggerganov ggerganov changed the title sync : llam.cpp (ggml-backend) sync : llama.cpp (ggml-backend) Jan 12, 2024
@slaren
Copy link
Collaborator

slaren commented Jan 12, 2024

I have fixed the issue with gpt-2, but the starcoder-mmap.cpp example uses the old ggml-cuda API, which is no longer supported.

@ggerganov
Copy link
Owner Author

I think it's time to remove this example - will do so now

@ggerganov ggerganov merged commit 8fb376b into master Jan 12, 2024
10 checks passed
@ggerganov ggerganov deleted the sync branch January 12, 2024 19:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants