"gpt_tokenize: unknown token" running RedPajama #163

markdjwilliams · 2023-05-17T18:54:27Z

I'm hitting an error while running RedPajama. It's likely the result of a misunderstanding on my part, so I'm hoping somebody can shed some light on what I'm doing wrong.

To begin with, I've cloned ggml from commit 74705055853f7922e9622bdd0a1ebde2b8f57431. I build with gcc 9.4.0 on Linux x86:

mkdir build; cd build; cmake ..; make -j 12

This completes without error. I've already cloned https://huggingface.co/togethercomputer/RedPajama-INCITE-Base-3B-v1, so proceed to ggml conversion:

$ python examples/gpt-neox/convert-h5-to-ggml.py /tmp/RedPajama-INCITE-Base-3B-v1-HEAD/ 0
gpt_neox.embed_in.weight torch.Size([50432, 2560]) torch.float32
gpt_neox.layers.0.input_layernorm.weight torch.Size([2560]) torch.float32
gpt_neox.layers.0.input_layernorm.bias torch.Size([2560]) torch.float32
gpt_neox.layers.0.post_attention_layernorm.weight torch.Size([2560]) torch.float32
gpt_neox.layers.0.post_attention_layernorm.bias torch.Size([2560]) torch.float32
gpt_neox.layers.0.attention.bias torch.Size([1, 1, 2048, 2048]) torch.bool
gpt_neox.layers.0.attention.masked_bias torch.Size([]) torch.float32
gpt_neox.layers.0.attention.rotary_emb.inv_freq torch.Size([40]) torch.float32
gpt_neox.layers.0.attention.query_key_value.weight torch.Size([7680, 2560]) torch.float32
..... snip .....
gpt_neox.layers.31.attention.query_key_value.weight torch.Size([7680, 2560]) torch.float32
gpt_neox.layers.31.attention.query_key_value.bias torch.Size([7680]) torch.float32
gpt_neox.layers.31.attention.dense.weight torch.Size([2560, 2560]) torch.float32
gpt_neox.layers.31.attention.dense.bias torch.Size([2560]) torch.float32
gpt_neox.layers.31.mlp.dense_h_to_4h.weight torch.Size([10240, 2560]) torch.float32
gpt_neox.layers.31.mlp.dense_h_to_4h.bias torch.Size([10240]) torch.float32
gpt_neox.layers.31.mlp.dense_4h_to_h.weight torch.Size([2560, 10240]) torch.float32
gpt_neox.layers.31.mlp.dense_4h_to_h.bias torch.Size([2560]) torch.float32
gpt_neox.final_layer_norm.weight torch.Size([2560]) torch.float32
gpt_neox.final_layer_norm.bias torch.Size([2560]) torch.float32
embed_out.weight torch.Size([50432, 2560]) torch.float32
{'_name_or_path': 'rp_3b_800b', 'architectures': ['GPTNeoXForCausalLM'], 'bos_token_id': 0, 'eos_token_id': 0, 'hidden_act': 'gelu', 'hidden_size': 2560, 'initializer_range': 0.02, 'intermediate_size': 10240, 'layer_norm_eps': 1e-05, 'max_position_embeddings': 2048, 'model_type': 'gpt_neox', 'num_attention_heads': 32, 'num_hidden_layers': 32, 'rotary_emb_base': 10000, 'rotary_pct': 1.0, 'tie_word_embeddings': False, 'torch_dtype': 'float16', 'transformers_version': '4.28.1', 'use_cache': True, 'use_parallel_residual': False, 'vocab_size': 50432}
Processing variable: gpt_neox.embed_in.weight with shape:  (50432, 2560)
Processing variable: gpt_neox.layers.0.input_layernorm.weight with shape:  (2560,)
Processing variable: gpt_neox.layers.0.input_layernorm.bias with shape:  (2560,)
Processing variable: gpt_neox.layers.0.post_attention_layernorm.weight with shape:  (2560,)
Processing variable: gpt_neox.layers.0.post_attention_layernorm.bias with shape:  (2560,)
Processing variable: gpt_neox.layers.0.attention.bias with shape:  (2048, 2048)
  Skipping variable: gpt_neox.layers.0.attention.bias
Processing variable: gpt_neox.layers.0.attention.masked_bias with shape:  ()
  Skipping variable: gpt_neox.layers.0.attention.masked_bias
Processing variable: gpt_neox.layers.0.attention.rotary_emb.inv_freq with shape:  (40,)
  Skipping variable: gpt_neox.layers.0.attention.rotary_emb.inv_freq
Processing variable: gpt_neox.layers.0.attention.query_key_value.weight with shape:  (7680, 2560)
Processing variable: gpt_neox.layers.0.attention.query_key_value.bias with shape:  (7680,)
Processing variable: gpt_neox.layers.0.attention.dense.weight with shape:  (2560, 2560)
Processing variable: gpt_neox.layers.0.attention.dense.bias with shape:  (2560,)
.... snip ....
Processing variable: gpt_neox.layers.31.attention.rotary_emb.inv_freq with shape:  (40,)
  Skipping variable: gpt_neox.layers.31.attention.rotary_emb.inv_freq
Processing variable: gpt_neox.layers.31.attention.query_key_value.weight with shape:  (7680, 2560)
Processing variable: gpt_neox.layers.31.attention.query_key_value.bias with shape:  (7680,)
Processing variable: gpt_neox.layers.31.attention.dense.weight with shape:  (2560, 2560)
Processing variable: gpt_neox.layers.31.attention.dense.bias with shape:  (2560,)
Processing variable: gpt_neox.layers.31.mlp.dense_h_to_4h.weight with shape:  (10240, 2560)
Processing variable: gpt_neox.layers.31.mlp.dense_h_to_4h.bias with shape:  (10240,)
Processing variable: gpt_neox.layers.31.mlp.dense_4h_to_h.weight with shape:  (2560, 10240)
Processing variable: gpt_neox.layers.31.mlp.dense_4h_to_h.bias with shape:  (2560,)
Processing variable: gpt_neox.final_layer_norm.weight with shape:  (2560,)
Processing variable: gpt_neox.final_layer_norm.bias with shape:  (2560,)
Processing variable: embed_out.weight with shape:  (50432, 2560)
Done. Output file: /tmp/ggml-model-f32.bin

Next, I quantize the model:

$ gpt-neox-quantize /tmp/RedPajama-INCITE-Base-3B-v1-HEAD/ggml-model-f32.bin /tmp/q4_0.bin "q4_0"
gpt_neox_model_quantize: loading model from '/tmp/ggml-model-f32.bin'
gpt_neox_model_quantize: n_vocab     = 50432
gpt_neox_model_quantize: n_ctx       = 2048
gpt_neox_model_quantize: n_embd      = 2560
gpt_neox_model_quantize: n_head      = 32
gpt_neox_model_quantize: n_layer     = 32
gpt_neox_model_quantize: par_res     = 0
gpt_neox_model_quantize: ftype (src) = 0
gpt_neox_model_quantize: qntvr (src) = 0
gpt_neox_model_quantize: ftype (dst) = 1002
gpt_neox_model_quantize: qntvr (dst) = 1
                                        gpt_neox.embed_in.weight - [ 2560, 50432,     1], type =    f32 size =   492.50 MB ->    76.95 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.026 0.021 
                        gpt_neox.layers.0.input_layernorm.weight - [ 2560,     1,     1], type =    f32 size =    0.010 MB
                          gpt_neox.layers.0.input_layernorm.bias - [ 2560,     1,     1], type =    f32 size =    0.010 MB
               gpt_neox.layers.0.post_attention_layernorm.weight - [ 2560,     1,     1], type =    f32 size =    0.010 MB
                 gpt_neox.layers.0.post_attention_layernorm.bias - [ 2560,     1,     1], type =    f32 size =    0.010 MB
              gpt_neox.layers.0.attention.query_key_value.weight - [ 2560,  7680,     1], type =    f32 size =    75.00 MB ->    11.72 MB | hist: 0.036 0.015 0.024 0.038 0.055 0.076 0.097 0.114 0.121 0.114 0.097 0.076 0.055 0.038 0.024 0.020 
                gpt_neox.layers.0.attention.query_key_value.bias - [ 7680,     1,     1], type =    f32 size =    0.029 MB
                        gpt_neox.layers.0.attention.dense.weight - [ 2560,  2560,     1], type =    f32 size =    25.00 MB ->     3.91 MB | hist: 0.036 0.013 0.021 0.033 0.051 0.074 0.099 0.122 0.132 0.122 0.099 0.074 0.051 0.033 0.021 0.017 
                          gpt_neox.layers.0.attention.dense.bias - [ 2560,     1,     1], type =    f32 size =    0.010 MB
                      gpt_neox.layers.0.mlp.dense_h_to_4h.weight - [ 2560, 10240,     1], type =    f32 size =   100.00 MB ->    15.62 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
                        gpt_neox.layers.0.mlp.dense_h_to_4h.bias - [10240,     1,     1], type =    f32 size =    0.039 MB
                      gpt_neox.layers.0.mlp.dense_4h_to_h.weight - [10240,  2560,     1], type =    f32 size =   100.00 MB ->    15.62 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.097 0.111 0.117 0.111 0.097 0.077 0.056 0.039 0.025 0.021 
                        gpt_neox.layers.0.mlp.dense_4h_to_h.bias - [ 2560,     1,     1], type =    f32 size =    0.010 MB
                        gpt_neox.layers.1.input_layernorm.weight - [ 2560,     1,     1], type =    f32 size =    0.010 MB
                          gpt_neox.layers.1.input_layernorm.bias - [ 2560,     1,     1], type =    f32 size =    0.010 MB
               gpt_neox.layers.1.post_attention_layernorm.weight - [ 2560,     1,     1], type =    f32 size =    0.010 MB
                 gpt_neox.layers.1.post_attention_layernorm.bias - [ 2560,     1,     1], type =    f32 size =    0.010 MB
              gpt_neox.layers.1.attention.query_key_value.weight - [ 2560,  7680,     1], type =    f32 size =    75.00 MB ->    11.72 MB | hist: 0.037 0.016 0.025 0.039 0.056 0.077 0.097 0.111 0.117 0.111 0.097 0.077 0.057 0.039 0.025 0.021 
                gpt_neox.layers.1.attention.query_key_value.bias - [ 7680,     1,     1], type =    f32 size =    0.029 MB
                        gpt_neox.layers.1.attention.dense.weight - [ 2560,  2560,     1], type =    f32 size =    25.00 MB ->     3.91 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
                          gpt_neox.layers.1.attention.dense.bias - [ 2560,     1,     1], type =    f32 size =    0.010 MB
                      gpt_neox.layers.1.mlp.dense_h_to_4h.weight - [ 2560, 10240,     1], type =    f32 size =   100.00 MB ->    15.62 MB | hist: 0.037 0.016 0.025 0.039 0.056 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
                        gpt_neox.layers.1.mlp.dense_h_to_4h.bias - [10240,     1,     1], type =    f32 size =    0.039 MB
                      gpt_neox.layers.1.mlp.dense_4h_to_h.weight - [10240,  2560,     1], type =    f32 size =   100.00 MB ->    15.62 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.097 0.112 0.117 0.111 0.097 0.077 0.056 0.039 0.025 0.021 
                        gpt_neox.layers.1.mlp.dense_4h_to_h.bias - [ 2560,     1,     1], type =    f32 size =    0.010 MB
                        gpt_neox.layers.2.input_layernorm.weight - [ 2560,     1,     1], type =    f32 size =    0.010 MB
                          gpt_neox.layers.2.input_layernorm.bias - [ 2560,     1,     1], type =    f32 size =    0.010 MB
.... snip ....
              gpt_neox.layers.30.post_attention_layernorm.weight - [ 2560,     1,     1], type =    f32 size =    0.010 MB
                gpt_neox.layers.30.post_attention_layernorm.bias - [ 2560,     1,     1], type =    f32 size =    0.010 MB
             gpt_neox.layers.30.attention.query_key_value.weight - [ 2560,  7680,     1], type =    f32 size =    75.00 MB ->    11.72 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.097 0.112 0.118 0.112 0.096 0.077 0.056 0.039 0.025 0.021 
               gpt_neox.layers.30.attention.query_key_value.bias - [ 7680,     1,     1], type =    f32 size =    0.029 MB
                       gpt_neox.layers.30.attention.dense.weight - [ 2560,  2560,     1], type =    f32 size =    25.00 MB ->     3.91 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.097 0.112 0.117 0.112 0.097 0.076 0.056 0.039 0.025 0.021 
                         gpt_neox.layers.30.attention.dense.bias - [ 2560,     1,     1], type =    f32 size =    0.010 MB
                     gpt_neox.layers.30.mlp.dense_h_to_4h.weight - [ 2560, 10240,     1], type =    f32 size =   100.00 MB ->    15.62 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.097 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
                       gpt_neox.layers.30.mlp.dense_h_to_4h.bias - [10240,     1,     1], type =    f32 size =    0.039 MB
                     gpt_neox.layers.30.mlp.dense_4h_to_h.weight - [10240,  2560,     1], type =    f32 size =   100.00 MB ->    15.62 MB | hist: 0.036 0.014 0.022 0.035 0.053 0.075 0.099 0.118 0.126 0.118 0.099 0.075 0.053 0.035 0.022 0.018 
                       gpt_neox.layers.30.mlp.dense_4h_to_h.bias - [ 2560,     1,     1], type =    f32 size =    0.010 MB
                       gpt_neox.layers.31.input_layernorm.weight - [ 2560,     1,     1], type =    f32 size =    0.010 MB
                         gpt_neox.layers.31.input_layernorm.bias - [ 2560,     1,     1], type =    f32 size =    0.010 MB
              gpt_neox.layers.31.post_attention_layernorm.weight - [ 2560,     1,     1], type =    f32 size =    0.010 MB
                gpt_neox.layers.31.post_attention_layernorm.bias - [ 2560,     1,     1], type =    f32 size =    0.010 MB
             gpt_neox.layers.31.attention.query_key_value.weight - [ 2560,  7680,     1], type =    f32 size =    75.00 MB ->    11.72 MB | hist: 0.036 0.015 0.024 0.038 0.055 0.076 0.097 0.113 0.120 0.113 0.097 0.076 0.055 0.038 0.025 0.020 
               gpt_neox.layers.31.attention.query_key_value.bias - [ 7680,     1,     1], type =    f32 size =    0.029 MB
                       gpt_neox.layers.31.attention.dense.weight - [ 2560,  2560,     1], type =    f32 size =    25.00 MB ->     3.91 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.076 0.097 0.113 0.119 0.112 0.097 0.076 0.056 0.038 0.025 0.021 
                         gpt_neox.layers.31.attention.dense.bias - [ 2560,     1,     1], type =    f32 size =    0.010 MB
                     gpt_neox.layers.31.mlp.dense_h_to_4h.weight - [ 2560, 10240,     1], type =    f32 size =   100.00 MB ->    15.62 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.097 0.111 0.117 0.111 0.097 0.077 0.056 0.039 0.025 0.021 
                       gpt_neox.layers.31.mlp.dense_h_to_4h.bias - [10240,     1,     1], type =    f32 size =    0.039 MB
                     gpt_neox.layers.31.mlp.dense_4h_to_h.weight - [10240,  2560,     1], type =    f32 size =   100.00 MB ->    15.62 MB | hist: 0.036 0.014 0.022 0.035 0.052 0.074 0.099 0.120 0.129 0.120 0.099 0.074 0.052 0.035 0.022 0.018 
                       gpt_neox.layers.31.mlp.dense_4h_to_h.bias - [ 2560,     1,     1], type =    f32 size =    0.010 MB
                                gpt_neox.final_layer_norm.weight - [ 2560,     1,     1], type =    f32 size =    0.010 MB
                                  gpt_neox.final_layer_norm.bias - [ 2560,     1,     1], type =    f32 size =    0.010 MB
                                                embed_out.weight - [ 2560, 50432,     1], type =    f32 size =   492.50 MB ->    76.95 MB | hist: 0.037 0.016 0.026 0.040 0.057 0.077 0.097 0.111 0.116 0.110 0.096 0.077 0.057 0.039 0.025 0.021 
ggml_common_quantize_0: model size  = 10589.08 MB
ggml_common_quantize_0: quant size  =  1657.99 MB | ftype = 2 (q4_0)
ggml_common_quantize_0: hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.097 0.112 0.118 0.112 0.097 0.077 0.056 0.039 0.025 0.021 

main: quantize time = 122311.53 ms
main:    total time = 122311.53 ms

And finally attempt inference:

gpt-neox -m /tmp/q4_0.bin -p "I believe the meaning of life is"
main: seed = 1684347948
gpt_neox_model_load: loading model from '/tmp/q4_0.bin' - please wait ...
gpt_neox_model_load: n_vocab = 50432
gpt_neox_model_load: n_ctx   = 2048
gpt_neox_model_load: n_embd  = 2560
gpt_neox_model_load: n_head  = 32
gpt_neox_model_load: n_layer = 32
gpt_neox_model_load: n_rot   = 80
gpt_neox_model_load: par_res = 0
gpt_neox_model_load: ftype   = 1002
gpt_neox_model_load: qntvr   = 1
gpt_neox_model_load: ggml ctx size = 3737.93 MB
gpt_neox_model_load: memory_size =   640.00 MB, n_mem = 65536
gpt_neox_model_load: ................................................ done
gpt_neox_model_load: model size =  1657.99 MB / num tensors = 388
gpt_tokenize: unknown token 'I'
gpt_tokenize: unknown token ' '
gpt_tokenize: unknown token ' '
gpt_tokenize: unknown token 'e'
gpt_tokenize: unknown token ' '
gpt_tokenize: unknown token 'i'
gpt_tokenize: unknown token 's'
main: number of tokens in prompt = 5
main: token[0] =   2868,  believe
main: token[1] =    783, the
main: token[2] =   4495,  meaning
main: token[3] =   1171, of
main: token[4] =   5243,  lif

 believethe meaningof lif bovember Cl~ 2017ase New Testament teaches us that weially be born us

As you can see, errors of the form gpt_tokenize: unknown token 'I' appear and the output text is nonsensical. I seem to get the same problem whether I use a 32-bit, 16-bit, or 4-bit model.

Does anything look amiss in the steps that I've performed or the logs which are generated from conversion/quantization? Any help at all would be appreciated!

The text was updated successfully, but these errors were encountered:

markdjwilliams · 2023-05-18T16:47:53Z

The same failure occurs for the Mosaic model.

However, I think I've found the problem. The highlighted line here defines std::string word; outside of the vocab loading loop which is being updated as each word in the vocabulary is loaded. Simply moving the definition of word to within the inner loop seems to allow correct tokenization and inference, at least on my platform/compiler.

markdjwilliams · 2023-05-19T19:31:28Z

So std::string.data() returns a const char * for earlier revisions of the C++ specification.

This line casts away this constness before writing to the underlying storage, so on my compiler replacing (char *)word.data() with &word[0] also fixed the issue.

ggerganov closed this as completed in 8f9fc0a May 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"gpt_tokenize: unknown token" running RedPajama #163

"gpt_tokenize: unknown token" running RedPajama #163

markdjwilliams commented May 17, 2023 •

edited

Loading

markdjwilliams commented May 18, 2023 •

edited

Loading

markdjwilliams commented May 19, 2023

"gpt_tokenize: unknown token" running RedPajama #163

"gpt_tokenize: unknown token" running RedPajama #163

Comments

markdjwilliams commented May 17, 2023 • edited Loading

markdjwilliams commented May 18, 2023 • edited Loading

markdjwilliams commented May 19, 2023

markdjwilliams commented May 17, 2023 •

edited

Loading

markdjwilliams commented May 18, 2023 •

edited

Loading