GPT Benchmarks #2

mallorbc · 2022-10-13T04:34:51Z

GPT models without KV cache have to recalculate values and thus time to compute grows exponentially given a longer input.

Thus, for your benchmarks, how many tokens were generated, and with how many total? Does this support a caching system?

ggerganov · 2022-10-13T06:41:54Z

In all benchmarks I generated 200 tokens, starting with a prompt consisting of a single token.

My implementation does support KV caching - I used the term "memory":

ggml/examples/gpt-2/main.cpp

Lines 259 to 276 in e2f39f4

 // key + value memory 

 { 

 const auto & hparams = model.hparams; 

 const int n_embd = hparams.n_embd; 

 const int n_layer = hparams.n_layer; 

 const int n_ctx = hparams.n_ctx; 

 const int n_mem = n_layer*n_ctx; 

 const int n_elements = n_embd*n_mem; 

 model.memory_k = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, n_elements); 

 model.memory_v = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, n_elements); 

 const size_t memory_size = ggml_nbytes(model.memory_k) + ggml_nbytes(model.memory_v); 

 printf("%s: memory size = %8.2f MB, n_mem = %d\n", __func__, memory_size/1024.0/1024.0, n_mem); 

 }

Here we store new values into the memory:

ggml/examples/gpt-2/main.cpp

Lines 448 to 455 in e2f39f4

 // store key and value to memory 

 if (N >= 1) { 

 struct ggml_tensor * k = ggml_view_1d(ctx0, model.memory_k, N*n_embd, (ggml_element_size(model.memory_k)*n_embd)*(il*n_ctx + n_past)); 

 struct ggml_tensor * v = ggml_view_1d(ctx0, model.memory_v, N*n_embd, (ggml_element_size(model.memory_v)*n_embd)*(il*n_ctx + n_past)); 

 ggml_build_forward_expand(&gf, ggml_cpy(ctx0, Kcur, k)); 

 ggml_build_forward_expand(&gf, ggml_cpy(ctx0, Vcur, v)); 

 }

And here we use the cached data:

ggml/examples/gpt-2/main.cpp

Lines 466 to 473 in e2f39f4

 // K = Kmem.view(n_embd/n_head, n_head, n_past + N).permute(0, 2, 1, 3) 

 // [64, n_past + N, 12] 

 struct ggml_tensor * K = 

 ggml_permute(ctx0, 

 ggml_reshape_3d(ctx0, 

 ggml_view_1d(ctx0, model.memory_k, (n_past + N)*n_embd, il*n_ctx*ggml_element_size(model.memory_k)*n_embd), 

 n_embd/n_head, n_head, n_past + N), 

 0, 2, 1, 3);

ggml/examples/gpt-2/main.cpp

Lines 507 to 514 in e2f39f4

 // V_trans = Vmem.view(n_embd/n_head, n_head, n_past + N).permute(1, 2, 0, 3).contiguous() 

 // [n_past + N, 64, 12] 

 struct ggml_tensor * V_trans = 

 ggml_permute(ctx0, 

 ggml_reshape_3d(ctx0, 

 ggml_view_1d(ctx0, model.memory_v, (n_past + N)*n_embd, il*n_ctx*ggml_element_size(model.memory_v)*n_embd), 

 n_embd/n_head, n_head, n_past + N), 

 1, 2, 0, 3);

Even with caching, the processing time increases with more and more tokens. The benchmarks are the average time across generating the 200 tokens.

mallorbc · 2022-10-18T16:17:35Z

Thanks for the insight. Also, very impressive work.

* vvhg-code-infill (ggerganov#1) * infill in separate example (ggerganov#2) * reverted changes to main and added infill example * cleanup * naming improvement * make : add missing blank line * fix missing semicolon * brought infill up to current main code * cleanup --------- Co-authored-by: Cebtenzzre <[email protected]>

ggerganov added the question Further information is requested label Oct 13, 2022

katsu560 mentioned this issue Mar 18, 2023

add OpenBLAS detection and modify tests codes #40

Merged

CCLDArjun pushed a commit to CCLDArjun/ggml that referenced this issue Dec 18, 2023

Fix un-initialized FP16 tables on x86 (ggerganov#15, ggerganov#2)

a9e5852

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPT Benchmarks #2

GPT Benchmarks #2

mallorbc commented Oct 13, 2022

ggerganov commented Oct 13, 2022

mallorbc commented Oct 18, 2022

GPT Benchmarks #2

GPT Benchmarks #2

Comments

mallorbc commented Oct 13, 2022

ggerganov commented Oct 13, 2022

mallorbc commented Oct 18, 2022