Bug: Embedding endpoint takes exponential time to process a long unknown token #8029

skoulik · 2024-06-20T07:49:59Z

What happened?

I am feeding the server's embeddings endpoint with a long sequence of "a" characters (a single unknown token) and it takes the server an exponentially growing amount of time to respond.

Len, Time (sec)
1, 2.1065430641174316
1001, 0.25893115997314453
2001, 1.0023913383483887
3001, 2.5023365020751953
4001, 5.819644927978516
5001, 10.880044221878052
6001, 18.553367137908936
7001, 28.890690565109253
8001, 42.91253876686096
9001, 60.954835414886475
10001, 83.25964713096619
11001, 110.93159890174866
...

llama-server --n-gpu-layers 13 --model models\embedding\nomic-embed-text-v 1.5.f16.gguf --ctx-size 8192 --batch-size 8192 --ubatch-size 8192 --rope-scaling yarn --rope-freq-scale .75 --port 8081 --embeddings

The embeddings model downloaded from: https://huggingface.co/nomic-ai/nomic-embed-text-v1.5-GGUF

By the load pattern (CPU only at ~10%) I guess the problem is with the tokenizer. GPU is idle.

Test script:

import httpx
import time

http_client = httpx.Client(
    base_url = "http:https://localhost:8081",
    timeout  = 600,
    headers  = {'Accept': "application/json"}
)

print("Len, Time (sec)")
for n in range(1, 16000, 1000):
    text = "a"*n;

    start = time.time()
    request_json =  {'input': [text], 'model': "", 'encoding_format': "float"}
    response = http_client.post(
        url  = "/embeddings",
        json = request_json
    )

    end = time.time()
    print(f"{n}, {end-start}")

Name and Version

b3187 (2075a66)

What operating system are you seeing the problem on?

Windows

Relevant log output

INFO [                    main] build info | tid="39092" timestamp=1718867632 build=3187 commit="2075a66a"
INFO [                    main] system info | tid="39092" timestamp=1718867632 n_threads=16 n_threads_batch=-1 total_threads=32 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
llama_model_loader: loaded meta data with 22 key-value pairs and 112 tensors from D:\code\test_llm\models\embedding\nomic-embed-text-v1.5.f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = nomic-bert
llama_model_loader: - kv   1:                               general.name str              = nomic-embed-text-v1.5
llama_model_loader: - kv   2:                     nomic-bert.block_count u32              = 12
llama_model_loader: - kv   3:                  nomic-bert.context_length u32              = 2048
llama_model_loader: - kv   4:                nomic-bert.embedding_length u32              = 768
llama_model_loader: - kv   5:             nomic-bert.feed_forward_length u32              = 3072
llama_model_loader: - kv   6:            nomic-bert.attention.head_count u32              = 12
llama_model_loader: - kv   7:    nomic-bert.attention.layer_norm_epsilon f32              = 0.000000
llama_model_loader: - kv   8:                          general.file_type u32              = 1
llama_model_loader: - kv   9:                nomic-bert.attention.causal bool             = false
llama_model_loader: - kv  10:                    nomic-bert.pooling_type u32              = 1
llama_model_loader: - kv  11:                  nomic-bert.rope.freq_base f32              = 1000.000000
llama_model_loader: - kv  12:            tokenizer.ggml.token_type_count u32              = 2
llama_model_loader: - kv  13:                tokenizer.ggml.bos_token_id u32              = 101
llama_model_loader: - kv  14:                tokenizer.ggml.eos_token_id u32              = 102
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = bert
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,30522]   = ["[PAD]", "[unused0]", "[unused1]", "...
llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,30522]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,30522]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 100
llama_model_loader: - kv  20:          tokenizer.ggml.seperator_token_id u32              = 102
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - type  f32:   51 tensors
llama_model_loader: - type  f16:   61 tensors
llm_load_vocab: special tokens cache size = 5
llm_load_vocab: token to piece cache size = 0.2032 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = nomic-bert
llm_load_print_meta: vocab type       = WPM
llm_load_print_meta: n_vocab          = 30522
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 768
llm_load_print_meta: n_head           = 12
llm_load_print_meta: n_head_kv        = 12
llm_load_print_meta: n_layer          = 12
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 768
llm_load_print_meta: n_embd_v_gqa     = 768
llm_load_print_meta: f_norm_eps       = 1.0e-12
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 3072
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 0
llm_load_print_meta: pooling type     = 1
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 137M
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 136.73 M
llm_load_print_meta: model size       = 260.86 MiB (16.00 BPW)
llm_load_print_meta: general.name     = nomic-embed-text-v1.5
llm_load_print_meta: BOS token        = 101 '[CLS]'
llm_load_print_meta: EOS token        = 102 '[SEP]'
llm_load_print_meta: UNK token        = 100 '[UNK]'
llm_load_print_meta: SEP token        = 102 '[SEP]'
llm_load_print_meta: PAD token        = 0 '[PAD]'
llm_load_print_meta: CLS token        = 101 '[CLS]'
llm_load_print_meta: MASK token       = 103 '[MASK]'
llm_load_print_meta: LF token         = 0 '[PAD]'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
llm_load_tensors: ggml ctx size =    0.11 MiB
llm_load_tensors: offloading 12 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 13/13 layers to GPU
llm_load_tensors:        CPU buffer size =    44.72 MiB
llm_load_tensors:      CUDA0 buffer size =   216.15 MiB
.......................................................
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 8192
llama_new_context_with_model: n_ubatch   = 8192
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000.0
llama_new_context_with_model: freq_scale = 0.75
llama_kv_cache_init:      CUDA0 KV buffer size =   288.00 MiB
llama_new_context_with_model: KV self size  =  288.00 MiB, K (f16):  144.00 MiB, V (f16):  144.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.00 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  3728.03 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   536.06 MiB
llama_new_context_with_model: graph nodes  = 453
llama_new_context_with_model: graph splits = 2
INFO [                    init] initializing slots | tid="39092" timestamp=1718867633 n_slots=1
INFO [                    init] new slot | tid="39092" timestamp=1718867633 id_slot=0 n_ctx_slot=8192
INFO [                    main] model loaded | tid="39092" timestamp=1718867633
INFO [                    main] chat template | tid="39092" timestamp=1718867633 chat_example="<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\nHi there<|im_end|>\n<|im_start|>user\nHow are you?<|im_end|>\n<|im_start|>assistant\n" built_in=true
INFO [                    main] HTTP server listening | tid="39092" timestamp=1718867633 hostname="127.0.0.1" port="8081" n_threads_http="31"
INFO [            update_slots] all slots are idle | tid="39092" timestamp=1718867633

...

INFO [   launch_slot_with_task] slot is processing task | tid="39092" timestamp=1718868973 id_slot=0 id_task=110
INFO [            update_slots] kv cache rm [p0, end) | tid="39092" timestamp=1718869015 id_slot=0 id_task=110 p0=0
INFO [            update_slots] slot released | tid="39092" timestamp=1718869016 id_slot=0 id_task=110 n_ctx=8192 n_past=4002 n_system_tokens=0 n_cache_tokens=0 truncated=false
INFO [            update_slots] all slots are idle | tid="39092" timestamp=1718869016
INFO [      log_server_request] request | tid="37020" timestamp=1718869016 remote_addr="127.0.0.1" remote_port=54681 status=200 method="POST" path="/embeddings" params={}
INFO [   launch_slot_with_task] slot is processing task | tid="39092" timestamp=1718869016 id_slot=0 id_task=112
INFO [            update_slots] kv cache rm [p0, end) | tid="39092" timestamp=1718869076 id_slot=0 id_task=112 p0=0
INFO [            update_slots] slot released | tid="39092" timestamp=1718869076 id_slot=0 id_task=112 n_ctx=8192 n_past=4502 n_system_tokens=0 n_cache_tokens=0 truncated=false
INFO [            update_slots] all slots are idle | tid="39092" timestamp=1718869076
INFO [      log_server_request] request | tid="37020" timestamp=1718869076 remote_addr="127.0.0.1" remote_port=54681 status=200 method="POST" path="/embeddings" params={}
INFO [   launch_slot_with_task] slot is processing task | tid="39092" timestamp=1718869076 id_slot=0 id_task=114
INFO [            update_slots] kv cache rm [p0, end) | tid="39092" timestamp=1718869159 id_slot=0 id_task=114 p0=0
INFO [            update_slots] slot released | tid="39092" timestamp=1718869160 id_slot=0 id_task=114 n_ctx=8192 n_past=5002 n_system_tokens=0 n_cache_tokens=0 truncated=false
INFO [            update_slots] all slots are idle | tid="39092" timestamp=1718869160
INFO [      log_server_request] request | tid="37020" timestamp=1718869160 remote_addr="127.0.0.1" remote_port=54681 status=200 method="POST" path="/embeddings" params={}
INFO [   launch_slot_with_task] slot is processing task | tid="39092" timestamp=1718869160 id_slot=0 id_task=116
INFO [            update_slots] kv cache rm [p0, end) | tid="39092" timestamp=1718869270 id_slot=0 id_task=116 p0=0
INFO [            update_slots] slot released | tid="39092" timestamp=1718869271 id_slot=0 id_task=116 n_ctx=8192 n_past=5502 n_system_tokens=0 n_cache_tokens=0 truncated=false
INFO [            update_slots] all slots are idle | tid="39092" timestamp=1718869271
INFO [      log_server_request] request | tid="37020" timestamp=1718869271 remote_addr="127.0.0.1" remote_port=54681 status=200 method="POST" path="/embeddings" params={}
INFO [   launch_slot_with_task] slot is processing task | tid="39092" timestamp=1718869271 id_slot=0 id_task=118

The text was updated successfully, but these errors were encountered:

ggerganov · 2024-06-20T11:14:04Z

It looks like the WPM tokenizer is very slow in this case. Needs some profiling and optimizations

ggerganov · 2024-06-20T11:52:44Z

#8034 should fix this

skoulik · 2024-06-21T00:53:45Z

#8034 should fix this

Yup, much faster now:

Len, Time (sec)
1, 2.0950117111206055
1001, 0.1460411548614502
2001, 0.0400087833404541
3001, 0.06301617622375488
4001, 0.08401727676391602
5001, 0.10601592063903809
6001, 0.13503074645996094
7001, 0.1870429515838623
8001, 0.21605181694030762
9001, 0.2813396453857422
10001, 0.32607340812683105
11001, 0.4120943546295166
12001, 0.48211002349853516
13001, 0.5802309513092041
14001, 0.6383733749389648
15001, 0.7357058525085449

skoulik added bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) labels Jun 20, 2024

ggerganov added bug Something isn't working good first issue Good for newcomers and removed bug-unconfirmed labels Jun 20, 2024

ggerganov mentioned this issue Jun 20, 2024

llama : optimize long word tokenization with WPM #8034

Merged

4 tasks

ggerganov closed this as completed in #8034 Jun 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Embedding endpoint takes exponential time to process a long unknown token #8029

Bug: Embedding endpoint takes exponential time to process a long unknown token #8029

skoulik commented Jun 20, 2024 •

edited

Loading

ggerganov commented Jun 20, 2024

ggerganov commented Jun 20, 2024

skoulik commented Jun 21, 2024

Bug: Embedding endpoint takes exponential time to process a long unknown token #8029

Bug: Embedding endpoint takes exponential time to process a long unknown token #8029

Comments

skoulik commented Jun 20, 2024 • edited Loading

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

ggerganov commented Jun 20, 2024

ggerganov commented Jun 20, 2024

skoulik commented Jun 21, 2024

skoulik commented Jun 20, 2024 •

edited

Loading