gpt_tokenize : bug in tokenization and incapable of double-byte languages #170

jaeminSon · 2023-05-19T12:54:31Z

There is a bug in gpt_tokenize function in examples/common.cpp

Followings are comparison with a huggingface tokenizer,

# huggingface
>>> from transformers import AutoTokenizer, AutoModelForCausalLM
>>> tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")
>>> tokenizer.tokenize("ableable")
['able', 'able']

# ggml
main: number of tokens in prompt = 5
main: token[0] =  21762, able
main: token[1] =     68, a
main: token[2] =     69, b
main: token[3] =     79, l
main: token[4] =     72, e

Example above is due to this line

ggml/examples/common.cpp

Line 269 in 2a75bd4

if (j == i) {

Furthermore, current implementation does not consider double-byte characters such as Korean, Japanese, Chinese. I'm wondering whether it would be ideal to modify the gpt_tokenize function to be able to handle double-byte characters, or the repo aims on single character languages only as English and most roman languages are single-byte.

The text was updated successfully, but these errors were encountered:

klosax · 2023-05-19T23:16:30Z

I found a quick fix:

Replace line:

ggml/examples/common.cpp

Line 262 in 2a75bd4

break;

with:

j = n;
continue;

Perplexity tests:

model = mpt-7b-base-ggml-f16.bin
ctx = 512
batch size = 512
prompt = wiki.test.raw

Without this fix:
tokens in prompt = 326191
perplexity of first chunk = 30.60130927

With this fix:
tokens in prompt = 306246
perplexity of first chunk = 11.00999002

ggerganov · 2023-05-20T14:25:32Z

@klosax Thanks! Btw, would be nice to add a simple perplexity tool as a ggml example

@jaeminSon Do you still observe issues with latest master?

klosax · 2023-05-20T16:30:01Z

@klosax Thanks! Btw, would be nice to add a simple perplexity tool as a ggml example

I am currently working on a perplexity tool. The tool should be able to measure perplexity correctly on wiki.test.raw, but it seems that the gpt tokenizer does not work correctly with unicode characters.

jaeminSon · 2023-05-21T07:44:45Z

@ggerganov The current version fixes the above issue! LoL

It would be great if perplexities are compared between huggingface model and the converted ggml model!!!

Also, it would be nice to have test codes for tokenizers, perhaps something like this code (https://github.com/ggerganov/llama.cpp/blob/master/tests/test-tokenizer-0.cpp).

ggerganov · 2023-05-21T08:22:55Z

Yup, all great suggestions! PRs welcome

jaeminSon · 2023-05-22T14:04:08Z

Each individual model possesses its own distinct tokenizer, even though transformer architecture is identical. Conducting general tokenization tests means evaluating all currently available models, which is untanable.

In ggml repo, individual architecture is separately converted and inferenced in ‘examples’ directory, test approaches may differ architecture by architecture.

In such a context, I added a function that checks the correctness of tokenization depending on the model in a somewhat limited sense in main.cpp under gpt-neox, which I’m currently eager to use with ggml.

ggerganov · 2023-05-22T14:32:15Z

Conducting general tokenization tests means evaluating all currently available models, which is untanable.

How so? It does not sound too difficult

Here is one possible approach:

Create a file examples/prompts.txt with a large variety of prompts (English, Chinese, emojis, unicode, etc.)
Create a Python script that includes all transformers tokenizers that we currently use and tokenize all prompts from prompts.txt with each tokenizer and store the results in separate files (for example: examples/prompts-gpt-2.txt, examples/prompts-mpt.txt, examples/prompts-starcoder.txt, etc.)
In each model main.cpp add a test_tokenzier() function that loads examples/prompts.txt, tokenizes all prompts and compares with the corresponding results. For example, in examples/gpt-2/main.cpp we compare with examples/prompts-gpt-2.txt, etc.

jaeminSon · 2023-05-22T14:54:46Z

Create a Python script that includes all transformers tokenizers that we currently use and tokenize all prompts from prompts.txt with each tokenizer and store the results in separate files (for example: examples/prompts-gpt-2.txt, examples/prompts-mpt.txt, examples/prompts-starcoder.txt, etc.)

I also considered that option.

But I worried that there could be multiple models for different language even for the same architecture. Also models with different param sizes may have different vocabularies, thus different tokenizers. When I checked how llama.cpp, they checked with minimal cases. Still, as their primary target is just llama, testing tokenizer would be easier than ggml. Huggingface transformer also checks the tokenizer quite minimally (case of unigram - https://github.com/huggingface/tokenizers/blob/main/tokenizers/tests/unigram.rs)

Perhaps I can try to cover as many models as I can with english language tokens.

jaeminSon · 2023-05-23T16:17:25Z

As suggested here, I listed several prompts to test tokenization (under examples/prompts/test-cases.txt) and saved the tokenization results using the huggingface tokenizers. But there are several modification from your suggestions.

'prompts' folder is newly created under 'examples' folder. -- instead of having many text files unleashed loosely under examples, I thought putting them in a folder looks better.
moved 'test_tokenizer' function to common.cpp as the function affects all models. Then, I added a line calling 'test_tokenizer' after loading the ggml model. The new line is added to all architectures under 'examples' folder except for 'whisper' which I could not follow the flow yet (I'll add to it later).

In terms of testing the modified codes, I tested the code with polyglot-ko-1.3b. I will check other models sooner or later.

jaeminSon · 2023-05-24T00:16:15Z

replit uses its own method and struct for tokenization in its main.cpp. I will put test_tokenizer in main.cpp

jaeminSon · 2023-05-25T15:38:55Z

Cerebras-GPT-111M has a bit different tokenization compared with huggingface on random english texts (93 matched out of 100)

test_gpt_tokenizer : failed test: 'I l0ve t0 tr@vel @r0und the w0rld.'
test_gpt_tokenizer : tokens in huggingface: I(40),  l(300), 0(15), ve(303),  t(256), 0(15),  tr(491), @(31), vel(626),  @(2488), r(81), 0(15), und(917),  the(262),  w(266), 0(15), r(81), ld(335), .(13), 
test_gpt_tokenizer : tokens in ggml: I(48682432),  l(48650176), 0(48679232), ve(48650560),  t(48644544), 0(48679232),  tr(48641856), @(48681280), vel(48757440),  @(48373184), r(48687680), 0(48679232), und(48696384),  the(48645312),  w(48645824), 0(48679232), rl(39633216), d(48685888), .(48678976), 
test_gpt_tokenizer : failed test: 'She danced gracefully on the stage.'
test_gpt_tokenizer : tokens in huggingface: She(3347),  danced(39480),  grace(11542), fully(2759),  on(319),  the(262),  stage(3800), .(13), 
test_gpt_tokenizer : tokens in ggml: She(48256192),  danced(36421568),  graceful(38916096), ly(48650944),  on(48652608),  the(48645312),  stage(48309376), .(48678976), 
test_gpt_tokenizer : failed test: 'She dances gracefully to the music.'
test_gpt_tokenizer : tokens in huggingface: She(3347),  dances(38207),  grace(11542), fully(2759),  to(284),  the(262),  music(2647), .(13), 
test_gpt_tokenizer : tokens in ggml: She(48256192),  dances(36553536),  graceful(38916096), ly(48650944),  to(48648128),  the(48645312),  music(48491840), .(48678976), 
test_gpt_tokenizer : failed test: 'The birds are chirping in the trees.'
test_gpt_tokenizer : tokens in huggingface: The(464),  birds(10087),  are(389),  ch(442), ir(343), ping(13886),  in(287),  the(262),  trees(7150), .(13), 
test_gpt_tokenizer : tokens in ggml: The(48638400),  birds(34796736),  are(48628800),  chi(36165888), r(48687680), ping(35478976),  in(48648512),  the(48645312),  trees(48805952), .(48678976), 
test_gpt_tokenizer : failed test: 'The flowers are blooming in the garden.'
test_gpt_tokenizer : tokens in huggingface: The(464),  flowers(12734),  are(389),  blo(24924), oming(3383),  in(287),  the(262),  garden(11376), .(13), 
test_gpt_tokenizer : tokens in ggml: The(48638400),  flowers(35626432),  are(48628800),  bloom(37501888), ing(48647360),  in(48648512),  the(48645312),  garden(34633408), .(48678976), 
test_gpt_tokenizer : failed test: 'The flowers in the garden are blooming.'
test_gpt_tokenizer : tokens in huggingface: The(464),  flowers(12734),  in(287),  the(262),  garden(11376),  are(389),  blo(24924), oming(3383), .(13), 
test_gpt_tokenizer : tokens in ggml: The(48638400),  flowers(35626432),  in(48648512),  the(48645312),  garden(34633408),  are(48628800),  bloom(37501888), ing(48647360), .(48678976), 
test_gpt_tokenizer : failed test: 'Wh@t's y0ur f@v0rite m0vie?'
test_gpt_tokenizer : tokens in huggingface: Wh(1199), @(31), t(83), 's(338),  y(331), 0(15), ur(333),  f(277), @(31), v(85), 0(15), rite(6525),  m(285), 0(15), v(85), ie(494), ?(30), 
test_gpt_tokenizer : tokens in ggml: Wh(48535872), @(48681280), t(48687936), 's(48655040),  y(48654144), 0(48679232), ur(48654400),  f(48647232), @(48681280), v(48688192), 0(48679232), rite(49020864),  m(48648256), 0(48679232), vi(35071168), e(48686016), ?(48681152), 
test_gpt_tokenizer : 7 tests failed out of 100 tests.

jaeminSon changed the title ~~gpt_tokenize : bug in (tokenization & double-byte languages)~~ gpt_tokenize : bug in tokenization and incapable of double-byte languages May 19, 2023

ggerganov added a commit that referenced this issue May 20, 2023

common : fix gpt_tokenize (ref #170)

8b3a721

klosax mentioned this issue May 20, 2023

Add utf-8 support: gpt_tokenize / mpt model import #179

Merged

klosax mentioned this issue May 21, 2023

mpt: utf-8 support, perplexity testing, repeat penalty sampling #184

Merged

jaeminSon mentioned this issue May 22, 2023

examples : add tokenization tests and refactor codes #186

Merged

ggerganov closed this as completed in #186 May 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpt_tokenize : bug in tokenization and incapable of double-byte languages #170

gpt_tokenize : bug in tokenization and incapable of double-byte languages #170

jaeminSon commented May 19, 2023

klosax commented May 19, 2023

ggerganov commented May 20, 2023

klosax commented May 20, 2023

jaeminSon commented May 21, 2023

ggerganov commented May 21, 2023

jaeminSon commented May 22, 2023

ggerganov commented May 22, 2023

jaeminSon commented May 22, 2023

jaeminSon commented May 23, 2023

jaeminSon commented May 24, 2023

jaeminSon commented May 25, 2023

gpt_tokenize : bug in tokenization and incapable of double-byte languages #170

gpt_tokenize : bug in tokenization and incapable of double-byte languages #170

Comments

jaeminSon commented May 19, 2023

klosax commented May 19, 2023

ggerganov commented May 20, 2023

klosax commented May 20, 2023

jaeminSon commented May 21, 2023

ggerganov commented May 21, 2023

jaeminSon commented May 22, 2023

ggerganov commented May 22, 2023

jaeminSon commented May 22, 2023

jaeminSon commented May 23, 2023

jaeminSon commented May 24, 2023

jaeminSon commented May 25, 2023