Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync : llama.cpp #856

Merged
merged 43 commits into from
Jun 15, 2024
Merged
Changes from 1 commit
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
1230387
ggml : use atomic_flag for critical section (llama/7598)
slaren May 29, 2024
8fcca7c
llama-bench : add support for the RPC backend (llama/7435)
rgerganov May 29, 2024
1b7ff70
cuda : non-cont concat support (llama/7610)
ggerganov May 29, 2024
f2703f7
ggml : fix YARN + add tests + add asserts (llama/7617)
ggerganov May 29, 2024
79751ef
metal : add missing asserts (llama/7617)
ggerganov May 29, 2024
0269773
metal : remove invalid asserts (llama/7617)
ggerganov May 29, 2024
ef948cb
ggml : fix loongarch build (O2 issue) (llama/7636)
junchao-loongson May 30, 2024
a539f93
faster avx512 exp implementation (llama/7551)
chriselrod May 30, 2024
80d21d4
ggml : fix loongson compile warnings (llama/7537)
ggerganov May 31, 2024
5e6eeed
CUDA: quantized KV support for FA vec (llama/7527)
JohannesGaessler Jun 1, 2024
2a2e184
CUDA: fix Pascal FA, deq. KV to FP16 for batch > 8 (llama/7681)
JohannesGaessler Jun 1, 2024
030c282
Fix FlashAttention debug test, FP32 assert (llama/7684)
JohannesGaessler Jun 1, 2024
6ec5cfc
fix bug introduced in using calloc (llama/7701)
airlied Jun 2, 2024
9f8d074
kompute : implement op_getrows_f32 (llama/6403)
woachk Jun 3, 2024
225883c
Vulkan Mixture of Experts (MoE) support (llama/7628)
0cc4m Jun 3, 2024
597c758
ggml : use OpenMP as a thread pool (llama/7606)
msy-kato Jun 3, 2024
cac02b4
llama : offload to RPC in addition to other backends (llama/7640)
rgerganov Jun 3, 2024
a96df45
ggml : prevent builds with -ffinite-math-only (llama/7726)
ggerganov Jun 4, 2024
ad2ed7f
ggml : remove OpenCL (llama/7735)
ggerganov Jun 4, 2024
6eb6783
Allow number of nodes in CUDA graph to change (llama/7738)
agray3 Jun 4, 2024
c943c8e
ggml : refactor rope norm/neox (llama/7634)
ggerganov Jun 5, 2024
024c5bc
CUDA: refactor mmq, dmmv, mmvq (llama/7716)
JohannesGaessler Jun 5, 2024
b89a6ff
fix softmax r2r result wrong issue (llama/7811)
pengxin99 Jun 7, 2024
c7b818b
vulkan : reuse parent extra for views (llama/7806)
slaren Jun 7, 2024
ea4c21b
CUDA: revise q8_1 data layout for mul_mat_q (llama/7824)
JohannesGaessler Jun 9, 2024
2552787
use the correct SYCL context for host USM allocations (llama/7777)
bashbaug Jun 10, 2024
52d4a6d
CUDA: use tensor cores for MMQ (llama/7676)
JohannesGaessler Jun 10, 2024
2238cd2
CUDA: int8 tensor cores for MMQ (q4_K, q5_K, q6_K) (llama/7860)
JohannesGaessler Jun 11, 2024
9279216
Update Vulkan RoPE implementation (llama/7818)
0cc4m Jun 11, 2024
5de7ab4
vulkan: select only one device for single gpu with multiple drivers (…
Adriankhl Jun 11, 2024
47968ff
ggml : improve ggml_is_contiguous logic (llama/7856)
ggerganov Jun 12, 2024
c29e392
tests : add non-cont unary tests (llama/7857)
ggerganov Jun 12, 2024
228a35f
CUDA: fix broken oob check for FA vec f32 kernel (llama/7904)
JohannesGaessler Jun 12, 2024
5a8910e
move BLAS to a separate backend (llama/6210)
slaren Jun 13, 2024
d13c89f
rpc : fix ggml_backend_rpc_supports_buft() (llama/7918)
rgerganov Jun 13, 2024
ca9e524
metal : utilize max shared memory for mul_mat_id (llama/7935)
ggerganov Jun 14, 2024
65d8379
CUDA: faster q2_K, q3_K MMQ + int8 tensor cores (llama/7921)
JohannesGaessler Jun 14, 2024
f00648a
remove global variables (llama/7710)
airMeng Jun 15, 2024
77ea030
tests : adapt to changes (#0)
ggerganov Jun 15, 2024
872e074
sync : llama.cpp
ggerganov Jun 15, 2024
1a9eb9c
cuda : update build (#0)
ggerganov Jun 15, 2024
8714ee5
ggml : remove opencl (#0)
ggerganov Jun 15, 2024
e2b8b50
ci : add GG_BUILD_NO_DOWNLOAD
ggerganov Jun 15, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
metal : utilize max shared memory for mul_mat_id (llama/7935)
  • Loading branch information
ggerganov committed Jun 15, 2024
commit ca9e5242c9a7c773882a3137cb742bf495a45129
3 changes: 2 additions & 1 deletion src/ggml-metal.m
Original file line number Diff line number Diff line change
Expand Up @@ -1862,9 +1862,10 @@ static enum ggml_status ggml_metal_graph_compute(
// ne21 = n_rows
const int dst_rows = ne20*ne21;
const int dst_rows_min = n_as;
const int dst_rows_max = (ctx->device.maxThreadgroupMemoryLength - 32 - 8192)/4;

// max size of the rowids array in the kernel shared buffer
GGML_ASSERT(dst_rows <= 2048);
GGML_ASSERT(dst_rows <= dst_rows_max);

// for now the matrix-matrix multiplication kernel only works on A14+/M1+ SoCs
// AMD GPU and older A-chips will reuse matrix-vector multiplication kernel
Expand Down