sync : llama.cpp #838

ggerganov · 2024-05-26T15:36:07Z

No description provided.

As discussed in PR #6766, CUDA graphs were being disabled in the presence of long prompts. This fixes the issue by avoiding the consective update counter from incrementing unnecessarily for tokens in which cuda graphs are disabled due to batch size > 1.

…/6915) * Just reordering some structs. * Adding in the calls to mm_pause * Passing around the state * Renaming and moving a bunch of variables around. * Extracting the logic to it's own function. * Moving some variable definitions into the chunk function. * Moving some variables around * moving src1_cont inside * Moving row_size * adding the current_chunk * Reorg the code. * Formatting to match the orig patch * starting to setup the chunking variables * Starting the buildup of the loop * The yield shouldn't be necessary. * adding the looping structure based on the chunk configuration. * Add in the re-chunking code. * Making it much more likely to rechunk. * disable resizing if numa is enabled. * Updating comments with what we've learned. * Fix formatting * Couple more formatting fixes. * More style fixes. * Fix Warnings * Going with unused because there's conditional logic that needs it. * Update ggml.c * Update ggml.c ---------

… MSVC (llama/7191) * logging: add proper checks for clang to avoid errors and warnings with VA_ARGS * build: add CMake Presets and toolchian files for Windows ARM64 * matmul-int8: enable matmul-int8 with MSVC and fix Clang warnings * ci: add support for optimized Windows ARM64 builds with MSVC and LLVM * matmul-int8: fixed typos in q8_0_q8_0 matmuls Co-authored-by: Georgi Gerganov <[email protected]> * matmul-int8: remove unnecessary casts in q8_0_q8_0 --------- Co-authored-by: Georgi Gerganov <[email protected]>

ref: #7293

This change upstreams llamafile's vectorized expf() functions. This lets us compute softmax and silu more accurately than the short[65536] lookup table that GGML previously used to make this operation go faster. We can support aarch64 and sse2+ with the worst case rounding error of 2ulp. It makes make -j8 tests && ./tests/test-backend-ops -o SOFT_MAX -b CPU perf go 1.5x faster for SSE2+FMA, 1.9x faster for AVX2+FMA and 2.1x on AVX512

ref: #7293

…ero (llama/7313)

* Update and fix Vulkan softmax implementation * Update and fix Vulkan argsort implementation

* android : use "ci-android" branch for CI * ggml : disable SIMD exp and silu for 32-bit ARM ggml-ci * android : do not fetch, use add_subdirectory instead * cmake : provide binary dir

* logging: output capture in cuda module * fix compile error * fix: vsnprintf terminates with 0, string use not correct * post review * Update llama.cpp Co-authored-by: slaren <[email protected]> * Update llama.cpp Co-authored-by: slaren <[email protected]> --------- Co-authored-by: slaren <[email protected]>

* Fix empty Vulkan host buffers Add fp32 fp16 matmul shader Fix matmul shader alignment * Remove deprecated tensor->backend uses * Fix Vulkan validation errors on embedding models with no offloaded layers * Fix Vulkan llava segfault when not offloading layers

…ision for enabling AVX512_BF16 (llama/7258)

* add loongarch lsx and lasx optimize code * Add loongarch compilation support to makefile * revert stb_image.h * opt bytes_from_nibbles_32 and sum_i16_pairs_float * fix undeclared * format code * update * update 2 --------- Co-authored-by: Jinyang He <[email protected]>

* Update SYCL upscale operation * Formatting * Remove messages

* rpc : track allocated buffers ref: #7407 * rpc : pack rpc_tensor tightly

ggml-ci

* add phi3 128k support in convert-hf-to-gguf * add phi3 128k support in cuda * address build warnings on llama.cpp * adjust index value in cuda long rope freq factors * add long rope support in ggml cpu backend * make freq factors only depend on ctx size * remove unused rope scaling type 'su' frin gguf converter * fix flint warnings on convert-hf-to-gguf.py * set to the short freq factor when context size is small than trained context size * add one line of comments * metal : support rope freq_factors * ggml : update ggml_rope_ext API to support freq. factors * backends : add dev messages to support rope freq. factors * minor : style * tests : update to use new rope API * backends : fix pragma semicolons * minor : cleanup * llama : move rope factors from KV header to tensors * llama : remove tmp assert * cuda : fix compile warning * convert : read/write n_head_kv * llama : fix uninitialized tensors --------- Co-authored-by: Georgi Gerganov <[email protected]>

* cuda : fix rope pos data ggml-ci * ggml : drop mode & 1 == 1 support for ggml_rope ggml-ci * ggml : support freq_factors for f16 rope (CPU) ggml-ci * tests : add rope tests using frequency factors ggml-ci

…/7475)

* ggml : drop support for QK_K=64 ggml-ci * opencl : restore QK_K=256 define

ggml-ci

…/7433) * Add SVE support for q4_0_q8_0 q8_0_q8_0 * remove ifdef

ggml-ci

AidanBeltonS and others added 30 commits May 26, 2024 18:00

Add missing " (llama/7303)

559f2ac

ggml : tag ggml_tensor::backend as deprecated (llama/7290)

c398ed5

rpc : add command line arg for specifying backend memory

60ce953

ref: #7293

ggml-quants, llama : removed excess checks (llama/7274)

5c18324

rpc : set SO_REUSEADDR for the server socket (llama/7320)

91e23fe

ref: #7293

CUDA: faster large batch FA without tensor cores (llama/7314)

e77703c

ggml : fix quants nans when all the group weights are very close to z…

45a8ce4

…ero (llama/7313)

Update and fix Vulkan soft_max and argsort implementations (llama/7237)

ebf665e

* Update and fix Vulkan softmax implementation * Update and fix Vulkan argsort implementation

cuda : add half2 __shfl_xor() for ROCm 5.5 (llama/7263)

470037b

CUDA: deduplicate FlashAttention code (llama/7352)

9023137

android : use "ci-android" branch for CI (llama/7341)

a1cd16e

* android : use "ci-android" branch for CI * ggml : disable SIMD exp and silu for 32-bit ARM ggml-ci * android : do not fetch, use add_subdirectory instead * cmake : provide binary dir

cuda : clear error after buffer allocation failure (llama/7376)

27a7aff

ggml: implement quantized KV cache for FA (llama/7372)

0645b36

ggml : fix another case of quants nans (llama/7387)

4a92959

Add provisions for windows support for BF16 code including CMake prov…

36ee9fc

…ision for enabling AVX512_BF16 (llama/7258)

ggml-opencl, llama: using reserve() if count already known (llama/7272)

0f1f4cf

Update SYCL upscale operation (llama/7321)

037b549

* Update SYCL upscale operation * Formatting * Remove messages

rpc : track allocated buffers (llama/7411)

8bfc612

* rpc : track allocated buffers ref: #7407 * rpc : pack rpc_tensor tightly

CUDA: deduplicate mmq code (llama/7397)

27cc251

CUDA: fix unused warning in mmq.cu (llama/7442)

7f87624

metal : handle F16 inf values, fix FA partial offload (llama/7434)

c23318a

ggml-ci

cuda : fix rope + add tests (llama/7452)

2d2eae0

* cuda : fix rope pos data ggml-ci * ggml : drop mode & 1 == 1 support for ggml_rope ggml-ci * ggml : support freq_factors for f16 rope (CPU) ggml-ci * tests : add rope tests using frequency factors ggml-ci

JohannesGaessler and others added 11 commits May 26, 2024 18:00

CUDA: remove incorrect precision check (llama/7454)

dfb71ab

cuda : fix compile warning (llama/7454)

d7f4773

CUDA: fix FA out-of-bounds writes (llama/7465)

a16b243

CUDA: fix FA out-of-bounds reads (llama/7479)

16e642e

Update vulkan rope implementation to support frequency factors (llama…

48adf49

…/7475)

ggml : drop support for QK_K=64 (llama/7473)

1dca0f7

* ggml : drop support for QK_K=64 ggml-ci * opencl : restore QK_K=256 define

ggml : remove ggml_flash_attn and ggml_flash_ff (llama/7463)

9096f30

ggml-ci

ggml : silence UB sanitizer error during iq2_xxs quantization (llama/0)

72e6664

ggml: aarch64: SVE kernels for q8_0_q8_0, q4_0_q8_0 vector dot (llama…

0dc8f50

…/7433) * Add SVE support for q4_0_q8_0 q8_0_q8_0 * remove ifdef

sync : llama.cpp

30490eb

ggml-ci

ggml : restore ggml_rope_xpos_inplace (#0)

578eed8

ggml-ci

ggerganov merged commit ff1c61c into master May 28, 2024
10 of 11 checks passed

ggerganov deleted the sync branch May 28, 2024 11:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sync : llama.cpp #838

sync : llama.cpp #838

ggerganov commented May 26, 2024

sync : llama.cpp #838

sync : llama.cpp #838

Conversation

ggerganov commented May 26, 2024