Releases: ggerganov/llama.cpp
Releases · ggerganov/llama.cpp
b3488
ggml: bugfix: fix the inactive elements is agnostic for risc-v vector… … (#8748) In these codes, we want to retain the value that they previously held when mask[i] is false. So we should use undisturbed. With the default agnostic policy of rvv intrinsic, these values can be held or be written with 1s. Co-authored-by: carter.li <[email protected]>
b3487
cuda : organize vendor-specific headers into vendors directory (#8746) Signed-off-by: Xiaodong Ye <[email protected]>
b3486
[SYCL] add conv support (#8688)
b3485
cmake: use 1 more thread for non-ggml in CI (#8740)
b3484
chore : Fix vulkan related compiler warnings, add help text, improve … …CLI options (#8477) * chore: Fix compiler warnings, add help text, improve CLI options * Add prototypes for function definitions * Invert logic of --no-clean option to be more intuitive * Provide a new help prompt with clear instructions * chore : Add ignore rule for vulkan shader generator Signed-off-by: teleprint-me <[email protected]> * Update ggml/src/vulkan-shaders/vulkan-shaders-gen.cpp Co-authored-by: 0cc4m <[email protected]> * chore : Remove void and apply C++ style empty parameters * chore : Remove void and apply C++ style empty parameters --------- Signed-off-by: teleprint-me <[email protected]> Co-authored-by: 0cc4m <[email protected]>
b3483
llama : refactor session file management (#8699) * llama : refactor session file management * llama : saving and restoring state checks for overflow The size of the buffers should now be given to the functions working with them, otherwise a truncated file could cause out of bound reads. * llama : stream from session file instead of copying into a big buffer Loading session files should no longer cause a memory usage spike. * llama : llama_state_get_size returns the actual size instead of max This is a breaking change, but makes that function *much* easier to keep up to date, and it also makes it reflect the behavior of llama_state_seq_get_size. * llama : share code between whole and seq_id-specific state saving Both session file types now use a more similar format. * llama : no longer store all hparams in session files Instead, the model arch name is stored. The layer count and the embedding dimensions of the KV cache are still verified when loading. Storing all the hparams is not necessary. * llama : fix uint64_t format type * llama : various integer type cast and format string fixes Some platforms use "%lu" and others "%llu" for uint64_t. Not sure how to handle that, so casting to size_t when displaying errors. * llama : remove _context suffix for llama_data_context * llama : fix session file loading llama_state_get_size cannot be used to get the max size anymore. * llama : more graceful error handling of invalid session files * llama : remove LLAMA_MAX_RNG_STATE It's no longer necessary to limit the size of the RNG state, because the max size of session files is not estimated anymore. * llama : cast seq_id in comparison with unsigned n_seq_max
b3482
feat: Support Moore Threads GPU (#8383) * Update doc for MUSA Signed-off-by: Xiaodong Ye <[email protected]> * Add GGML_MUSA in Makefile Signed-off-by: Xiaodong Ye <[email protected]> * Add GGML_MUSA in CMake Signed-off-by: Xiaodong Ye <[email protected]> * CUDA => MUSA Signed-off-by: Xiaodong Ye <[email protected]> * MUSA adds support for __vsubss4 Signed-off-by: Xiaodong Ye <[email protected]> * Fix CI build failure Signed-off-by: Xiaodong Ye <[email protected]> --------- Signed-off-by: Xiaodong Ye <[email protected]>
b3479
ggml : add missing semicolon (#0) ggml-ci
b3472
llama : add support for llama 3.1 rope scaling factors (#8676) * Add llama 3.1 rope scaling factors to llama conversion and inference This commit generates the rope factors on conversion and adds them to the resulting model as a tensor. At inference time, these factors are passed to the `ggml_rope_ext` rope oepration, improving results for context windows above 8192 * Update convert_hf_to_gguf.py Co-authored-by: compilade <[email protected]> * address comments * address comments * Update src/llama.cpp Co-authored-by: compilade <[email protected]> * Update convert_hf_to_gguf.py Co-authored-by: compilade <[email protected]> --------- Co-authored-by: compilade <[email protected]>
b3471
llama : add function for model-based max number of graph nodes (#8622) * llama : model-based max number of graph nodes ggml-ci * llama : disable 405B max_nodes path due to lack of complaints ggml-ci