sync : llama.cpp #856

ggerganov · 2024-06-15T17:17:33Z

No description provided.

* ggml : use atomic_flag for critical section * add windows shims

* tests : add non-cont concat tests * cuda : non-cont concat support ggml-ci

* tests : add rope tests ggml-ci * ggml : fixes (hopefully) ggml-ci * tests : add non-cont tests ggml-ci * cuda : add asserts for rope/norm + fix DS2 ggml-ci * ggml : assert contiguousness * tests : reduce RoPE tests ggml-ci

* faster avx512 exp implementation * x->r * improve accuracy, handle special cases * remove `e`

* ggml : fix loongson compile warnings ggml-ci * Fix loongarch quantize test fail. Fix unexpected error introduced during rebase code. * tests : disable json test due to lack of python on the CI node ggml-ci --------- Co-authored-by: junchao-loongson <[email protected]>

* CUDA: quantized KV support for FA vec * try CI fix * fix commented-out kernel variants * add q8_0 q4_0 tests * fix nwarps > batch size * split fattn compile via extern templates * fix flake8 * fix metal tests * fix cmake * make generate_cu_files.py executable * add autogenerated .cu files * fix AMD * error if type_v != FP16 and not flash_attn * remove obsolete code

compilade pointed this out on the previous MR

op_getrows_f32 is required since ggerganov/llama.cpp#6122 for the Vulkan w/ Kompute backend to be functional. As such, implement this op to make this backend functional again.

* Finish Vulkan mul_mat_id implementation * Add Vulkan sum_rows and div ops * Fix MUL_MAT_ID matrix matrix shader * Fix MUL_MAT_ID matrix vector shader dispatch size * Fix MUL_MAT_ID matrix vector shader and dispatch code * Update Vulkan CPU offload for MUL_MAT_ID * Fix crash when using split mode none and setting a main GPU

* ggml: Added OpenMP for multi-threads processing * ggml : Limit the number of threads used to avoid deadlock * update shared state n_threads in parallel region * clear numa affinity for main thread even with openmp * enable openmp by default * fix msvc build * disable openmp on macos * ci : disable openmp with thread sanitizer * Update ggml.c Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: slaren <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>

* llama : offload to RPC in addition to other backends * - fix copy_tensor being called on the src buffer instead of the dst buffer - always initialize views in the view_src buffer - add RPC backend to Makefile build - add endpoint to all RPC object names * add rpc-server to Makefile * Update llama.cpp Co-authored-by: slaren <[email protected]> --------- Co-authored-by: slaren <[email protected]>

@JohannesGaessler

This enforces a check that -fno-finite-math-only was set and that the operating compiling mode is not in finite maths mode. This is because during rewriting of silu and softmax for cpu #7154 there emerged an issue where the result that was observed when >1 slot was nondeterministic as found by @JohannesGaessler. @LostRuins narrowed the problem down to -ffinite-math-only which was theorised to be due to SiLU, instead of flushing small values to 0, returns NaN or some other garbage. @jart proposed a fix that @ggerganov then implemented in this fix ref ggerganov/llama.cpp#7154 (comment)

ggml-ci

Previously the code would have failed to cope in the case that the number of nodes changes in an existing CUDA graph. This fixes the issue by removing an unnecessary conditional.

* ggml : unify rope norm/neox (CPU) * ggml : fix compile warning * ggml : remove GLM rope mode ggml-ci * metal : better rope implementation ggml-ci * cuda : better rope implementation ggml-ci * naming : n_orig_ctx -> n_ctx_orig ggml-ci * dev : add reminders to update backends ggml-ci * vulkan : fix ggml_rope_ext() usage * cuda : fix array size + indents ggml-ci

* CUDA: refactor mmq, dmmv, mmvq * fix out-of-bounds write * struct for qk, qr, qi * fix cmake build * mmq_type_traits

* vulkan : reuse parent extra for views * Fix validation error when multiple compute contexts are used in a graph --------- Co-authored-by: 0cc4m <[email protected]>

Signed-off-by: Ben Ashbaugh <[email protected]>

* CUDA: int8 tensor cores for MMQ (legacy quants) * fix out-of-bounds writes * __builtin_assume -> GGML_CUDA_ASSUME * fix writeback returning too early

* Update Vulkan RoPE implementation * Return nullptr on alloc_buffer when allocation fails, instead of throwing an exception Minor fixes * Fix segfault when running out of VRAM Co-authored-by: slaren <[email protected]> --------- Co-authored-by: slaren <[email protected]>

…llama/7582)

* ggml : improve ggml_is_contiguous logic ggml-ci * ggml : support more contiguous cases ggml-ci

* tests : add non-cont unary tests * ggml : update unary asserts and "supports_op" ggml-ci

* move BLAS to a separate backend * rename GGML_USE_OPENBLAS to GGML_USE_BLAS * alloc : reuse same buffer when the same buffer type if used multiple times * set number of threads automatically for openblas and blis * sched : print assignments when GGML_SCHED_DEBUG env variable is set * sched : allow ops with weights on an incompatible buffer type This will cause the weight to be copied to a backend that supports the op, which is very costly. The weight should have been stored in a buffer of a backend that can run the op, but llama.cpp cannot do this automatically at the moment. --------- Co-authored-by: Georgi Gerganov <[email protected]>

* CUDA: faster q2_K, q3_K MMQ + int8 tensor cores * try CI fix * try CI fix * try CI fix * fix data race * rever q2_K precision related changes

* separate DPCT helpers outside * replace global variables with context * remove useless extra * update mul_mat condition * remove duplicate buft initialization * remove duplicate extra and global work group size * remove useless backend check * remove duplicated extras * use macro for group_size and remove cuda-related

ggml-ci

slaren and others added 30 commits June 15, 2024 20:10

ggml : use atomic_flag for critical section (llama/7598)

1230387

* ggml : use atomic_flag for critical section * add windows shims

llama-bench : add support for the RPC backend (llama/7435)

8fcca7c

cuda : non-cont concat support (llama/7610)

1b7ff70

* tests : add non-cont concat tests * cuda : non-cont concat support ggml-ci

ggml : fix YARN + add tests + add asserts (llama/7617)

f2703f7

* tests : add rope tests ggml-ci * ggml : fixes (hopefully) ggml-ci * tests : add non-cont tests ggml-ci * cuda : add asserts for rope/norm + fix DS2 ggml-ci * ggml : assert contiguousness * tests : reduce RoPE tests ggml-ci

metal : add missing asserts (llama/7617)

79751ef

metal : remove invalid asserts (llama/7617)

0269773

ggml : fix loongarch build (O2 issue) (llama/7636)

ef948cb

faster avx512 exp implementation (llama/7551)

a539f93

* faster avx512 exp implementation * x->r * improve accuracy, handle special cases * remove `e`

CUDA: fix Pascal FA, deq. KV to FP16 for batch > 8 (llama/7681)

2a2e184

Fix FlashAttention debug test, FP32 assert (llama/7684)

030c282

fix bug introduced in using calloc (llama/7701)

6ec5cfc

compilade pointed this out on the previous MR

kompute : implement op_getrows_f32 (llama/6403)

9f8d074

op_getrows_f32 is required since ggerganov/llama.cpp#6122 for the Vulkan w/ Kompute backend to be functional. As such, implement this op to make this backend functional again.

ggml : remove OpenCL (llama/7735)

ad2ed7f

ggml-ci

Allow number of nodes in CUDA graph to change (llama/7738)

6eb6783

Previously the code would have failed to cope in the case that the number of nodes changes in an existing CUDA graph. This fixes the issue by removing an unnecessary conditional.

CUDA: refactor mmq, dmmv, mmvq (llama/7716)

024c5bc

* CUDA: refactor mmq, dmmv, mmvq * fix out-of-bounds write * struct for qk, qr, qi * fix cmake build * mmq_type_traits

fix softmax r2r result wrong issue (llama/7811)

b89a6ff

vulkan : reuse parent extra for views (llama/7806)

c7b818b

* vulkan : reuse parent extra for views * Fix validation error when multiple compute contexts are used in a graph --------- Co-authored-by: 0cc4m <[email protected]>

CUDA: revise q8_1 data layout for mul_mat_q (llama/7824)

ea4c21b

use the correct SYCL context for host USM allocations (llama/7777)

2552787

Signed-off-by: Ben Ashbaugh <[email protected]>

CUDA: use tensor cores for MMQ (llama/7676)

52d4a6d

* CUDA: int8 tensor cores for MMQ (legacy quants) * fix out-of-bounds writes * __builtin_assume -> GGML_CUDA_ASSUME * fix writeback returning too early

CUDA: int8 tensor cores for MMQ (q4_K, q5_K, q6_K) (llama/7860)

2238cd2

vulkan: select only one device for single gpu with multiple drivers (…

5de7ab4

…llama/7582)

ggerganov and others added 13 commits June 15, 2024 20:10

ggml : improve ggml_is_contiguous logic (llama/7856)

47968ff

* ggml : improve ggml_is_contiguous logic ggml-ci * ggml : support more contiguous cases ggml-ci

tests : add non-cont unary tests (llama/7857)

c29e392

* tests : add non-cont unary tests * ggml : update unary asserts and "supports_op" ggml-ci

CUDA: fix broken oob check for FA vec f32 kernel (llama/7904)

228a35f

rpc : fix ggml_backend_rpc_supports_buft() (llama/7918)

d13c89f

metal : utilize max shared memory for mul_mat_id (llama/7935)

ca9e524

CUDA: faster q2_K, q3_K MMQ + int8 tensor cores (llama/7921)

65d8379

* CUDA: faster q2_K, q3_K MMQ + int8 tensor cores * try CI fix * try CI fix * try CI fix * fix data race * rever q2_K precision related changes

tests : adapt to changes (#0)

77ea030

sync : llama.cpp

872e074

ggml-ci

cuda : update build (#0)

1a9eb9c

ggml-ci

ggml : remove opencl (#0)

8714ee5

ggml-ci

ci : add GG_BUILD_NO_DOWNLOAD

e2b8b50

ggml-ci

ggerganov force-pushed the sync branch from ada4ec4 to e2b8b50 Compare June 15, 2024 18:28

ggerganov merged commit dee0d41 into master Jun 15, 2024
10 checks passed

ggerganov deleted the sync branch June 16, 2024 10:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sync : llama.cpp #856

sync : llama.cpp #856

ggerganov commented Jun 15, 2024

sync : llama.cpp #856

sync : llama.cpp #856

Conversation

ggerganov commented Jun 15, 2024