Flash attention implementations do not handle case where value vectors have different dimension from query vectors #7343

fairydreaming · 2024-05-17T15:23:12Z

For example in ggml.c implementations of ops related to flash attention declare variable D and use it as both dimension of value vector and dimension or key/query vector This will fail for models where query and value vectors have different lengths (for example DeepSeek-V2).

Below are selected fragments of GGML_OP_FLASH_ATTN_EXT op implementation to illustrate the problem.

Creation of result tensor:

llama.cpp/ggml.c

Lines 6792 to 6793 in 51e9d02

 int64_t ne[4] = { q->ne[0], q->ne[2], q->ne[1], q->ne[3] }; 

 struct ggml_tensor * result = ggml_new_tensor(ctx, GGML_TYPE_F32, 4, ne);

(note that query tensor dimensions are used everywhere, while in reality ne[0] shall be equal to ne[0] of value tensor because the attention output is a linear combination of value vectors.

Definition of variable D:

llama.cpp/ggml.c

Line 15879 in 51e9d02

const int64_t D = neq0;

Assertions all expecting the same length:

llama.cpp/ggml.c

Lines 15889 to 15891 in 51e9d02

 GGML_ASSERT(neq0 == D); 

 GGML_ASSERT(nek0 == D); 

 GGML_ASSERT(nev0 == D);

Usage of D as a dimension of a value vector:

llama.cpp/ggml.c

Line 15958 in 51e9d02

memset(V16, 0, D*sizeof(ggml_fp16_t));

Usage of D as a dimension of a query vector:

llama.cpp/ggml.c

Lines 15985 to 15987 in 51e9d02

 for (int64_t d = 0; d < D; ++d) { 

 Q16[d] = GGML_FP32_TO_FP16(pq[d]); 

 }

Suggested solution: create two variables Dq (length of the query vector) and Dv (length of value vector) and use Dq as a query/key vector length and Dv as value vector length. I fixed ggml_compute_forward_flash_attn_ext_f16() this way and it produces correct results (confirmed by running DeepSeek-V2 with -fa option).

I'm not 100% sure if CUDA and Metal implementations are also affected, but it's likely - I also found the same variable D used in the code and comments like "K and V have same shape".

ggerganov · 2024-05-17T16:19:10Z

Thanks for reporting that - CUDA and Metal kernels should also be affected. We should fix that, but maybe after DS2 support is merged so we have something to test with

oldgithubman · 2024-06-01T07:18:09Z

Possibly relevant - #2445 (comment)

bartowski1182 · 2024-06-17T21:40:09Z

@JohannesGaessler don't wanna bother you but I assume that you'd be the best suited to handle this or at least shed some light onto how to handle it

JohannesGaessler · 2024-06-17T21:58:43Z

I don't think you need any special considerations in terms of program correctness - you would just have to implement it. The bigger challenge will be to implement this in such a way that the performance is good.

ggerganov · 2024-06-18T06:55:37Z

Btw, supporting different K/V head sizes might dramatically increase the number of FA kernels that we have to compile, so probably not really worth it

oldgithubman · 2024-07-21T07:53:08Z

@ggerganov reopen? FA would be very useful for deepseek-v2

fairydreaming added the bug-unconfirmed label May 17, 2024

rick-github mentioned this issue Jun 5, 2024

ollama（commits: d4a8610） run deepseek-v2:16b Error: llama runner process has terminated: signal: aborted (core dumped) ollama/ollama#4799

Closed

jmorganca mentioned this issue Jun 9, 2024

enable flash attention by default ollama/ollama#4943

Closed

1 task

ggerganov mentioned this issue Jun 17, 2024

Bug: Deepseek Coder MOE GGML_ASSERT: ggml.c:5705: ggml_nelements(a) == ne0*ne1 #7979

Closed

ggerganov added bug Something isn't working and removed bug-unconfirmed labels Jun 17, 2024

ggerganov added enhancement New feature or request and removed bug Something isn't working labels Jun 18, 2024

github-actions bot added the stale label Jul 21, 2024

github-actions bot removed the stale label Jul 22, 2024

github-actions bot added the stale label Aug 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flash attention implementations do not handle case where value vectors have different dimension from query vectors #7343

Flash attention implementations do not handle case where value vectors have different dimension from query vectors #7343

fairydreaming commented May 17, 2024

ggerganov commented May 17, 2024

oldgithubman commented Jun 1, 2024

bartowski1182 commented Jun 17, 2024

JohannesGaessler commented Jun 17, 2024

ggerganov commented Jun 18, 2024

oldgithubman commented Jul 21, 2024

Flash attention implementations do not handle case where value vectors have different dimension from query vectors #7343

Flash attention implementations do not handle case where value vectors have different dimension from query vectors #7343

Comments

fairydreaming commented May 17, 2024

ggerganov commented May 17, 2024

oldgithubman commented Jun 1, 2024

bartowski1182 commented Jun 17, 2024

JohannesGaessler commented Jun 17, 2024

ggerganov commented Jun 18, 2024

oldgithubman commented Jul 21, 2024