Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpt_tokenize : bug in tokenization and incapable of double-byte languages #170

Closed
jaeminSon opened this issue May 19, 2023 · 11 comments · Fixed by #186
Closed

gpt_tokenize : bug in tokenization and incapable of double-byte languages #170

jaeminSon opened this issue May 19, 2023 · 11 comments · Fixed by #186

Comments

@jaeminSon
Copy link
Contributor

There is a bug in gpt_tokenize function in examples/common.cpp

Followings are comparison with a huggingface tokenizer,

# huggingface
>>> from transformers import AutoTokenizer, AutoModelForCausalLM
>>> tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")
>>> tokenizer.tokenize("ableable")
['able', 'able']
# ggml
main: number of tokens in prompt = 5
main: token[0] =  21762, able
main: token[1] =     68, a
main: token[2] =     69, b
main: token[3] =     79, l
main: token[4] =     72, e

Example above is due to this line

if (j == i) {

Furthermore, current implementation does not consider double-byte characters such as Korean, Japanese, Chinese. I'm wondering whether it would be ideal to modify the gpt_tokenize function to be able to handle double-byte characters, or the repo aims on single character languages only as English and most roman languages are single-byte.

@jaeminSon jaeminSon changed the title gpt_tokenize : bug in (tokenization & double-byte languages) gpt_tokenize : bug in tokenization and incapable of double-byte languages May 19, 2023
@klosax
Copy link
Contributor

klosax commented May 19, 2023

I found a quick fix:

Replace line:

break;

with:

j = n;
continue;

Perplexity tests:

model = mpt-7b-base-ggml-f16.bin
ctx = 512
batch size = 512
prompt = wiki.test.raw

Without this fix:
tokens in prompt = 326191
perplexity of first chunk = 30.60130927

With this fix:
tokens in prompt = 306246
perplexity of first chunk = 11.00999002

ggerganov added a commit that referenced this issue May 20, 2023
@ggerganov
Copy link
Owner

@klosax Thanks! Btw, would be nice to add a simple perplexity tool as a ggml example

@jaeminSon Do you still observe issues with latest master?

@klosax
Copy link
Contributor

klosax commented May 20, 2023

@klosax Thanks! Btw, would be nice to add a simple perplexity tool as a ggml example

I am currently working on a perplexity tool. The tool should be able to measure perplexity correctly on wiki.test.raw, but it seems that the gpt tokenizer does not work correctly with unicode characters.

@jaeminSon
Copy link
Contributor Author

@ggerganov The current version fixes the above issue! LoL

It would be great if perplexities are compared between huggingface model and the converted ggml model!!!

Also, it would be nice to have test codes for tokenizers, perhaps something like this code (https://github.com/ggerganov/llama.cpp/blob/master/tests/test-tokenizer-0.cpp).

@ggerganov
Copy link
Owner

Yup, all great suggestions! PRs welcome

@jaeminSon
Copy link
Contributor Author

Each individual model possesses its own distinct tokenizer, even though transformer architecture is identical. Conducting general tokenization tests means evaluating all currently available models, which is untanable.

In ggml repo, individual architecture is separately converted and inferenced in ‘examples’ directory, test approaches may differ architecture by architecture.

In such a context, I added a function that checks the correctness of tokenization depending on the model in a somewhat limited sense in main.cpp under gpt-neox, which I’m currently eager to use with ggml.

@ggerganov
Copy link
Owner

Conducting general tokenization tests means evaluating all currently available models, which is untanable.

How so? It does not sound too difficult

Here is one possible approach:

  • Create a file examples/prompts.txt with a large variety of prompts (English, Chinese, emojis, unicode, etc.)
  • Create a Python script that includes all transformers tokenizers that we currently use and tokenize all prompts from prompts.txt with each tokenizer and store the results in separate files (for example: examples/prompts-gpt-2.txt, examples/prompts-mpt.txt, examples/prompts-starcoder.txt, etc.)
  • In each model main.cpp add a test_tokenzier() function that loads examples/prompts.txt, tokenizes all prompts and compares with the corresponding results. For example, in examples/gpt-2/main.cpp we compare with examples/prompts-gpt-2.txt, etc.

@jaeminSon
Copy link
Contributor Author

Create a Python script that includes all transformers tokenizers that we currently use and tokenize all prompts from prompts.txt with each tokenizer and store the results in separate files (for example: examples/prompts-gpt-2.txt, examples/prompts-mpt.txt, examples/prompts-starcoder.txt, etc.)

I also considered that option.

But I worried that there could be multiple models for different language even for the same architecture. Also models with different param sizes may have different vocabularies, thus different tokenizers. When I checked how llama.cpp, they checked with minimal cases. Still, as their primary target is just llama, testing tokenizer would be easier than ggml. Huggingface transformer also checks the tokenizer quite minimally (case of unigram - https://github.com/huggingface/tokenizers/blob/main/tokenizers/tests/unigram.rs)

Perhaps I can try to cover as many models as I can with english language tokens.

@jaeminSon
Copy link
Contributor Author

As suggested here, I listed several prompts to test tokenization (under examples/prompts/test-cases.txt) and saved the tokenization results using the huggingface tokenizers. But there are several modification from your suggestions.

  1. 'prompts' folder is newly created under 'examples' folder. -- instead of having many text files unleashed loosely under examples, I thought putting them in a folder looks better.
  2. moved 'test_tokenizer' function to common.cpp as the function affects all models. Then, I added a line calling 'test_tokenizer' after loading the ggml model. The new line is added to all architectures under 'examples' folder except for 'whisper' which I could not follow the flow yet (I'll add to it later).

In terms of testing the modified codes, I tested the code with polyglot-ko-1.3b. I will check other models sooner or later.

@jaeminSon
Copy link
Contributor Author

replit uses its own method and struct for tokenization in its main.cpp. I will put test_tokenizer in main.cpp

@jaeminSon
Copy link
Contributor Author

Cerebras-GPT-111M has a bit different tokenization compared with huggingface on random english texts (93 matched out of 100)

test_gpt_tokenizer : failed test: 'I l0ve t0 tr@vel @r0und the w0rld.'
test_gpt_tokenizer : tokens in huggingface: I(40),  l(300), 0(15), ve(303),  t(256), 0(15),  tr(491), @(31), vel(626),  @(2488), r(81), 0(15), und(917),  the(262),  w(266), 0(15), r(81), ld(335), .(13), 
test_gpt_tokenizer : tokens in ggml: I(48682432),  l(48650176), 0(48679232), ve(48650560),  t(48644544), 0(48679232),  tr(48641856), @(48681280), vel(48757440),  @(48373184), r(48687680), 0(48679232), und(48696384),  the(48645312),  w(48645824), 0(48679232), rl(39633216), d(48685888), .(48678976), 
test_gpt_tokenizer : failed test: 'She danced gracefully on the stage.'
test_gpt_tokenizer : tokens in huggingface: She(3347),  danced(39480),  grace(11542), fully(2759),  on(319),  the(262),  stage(3800), .(13), 
test_gpt_tokenizer : tokens in ggml: She(48256192),  danced(36421568),  graceful(38916096), ly(48650944),  on(48652608),  the(48645312),  stage(48309376), .(48678976), 
test_gpt_tokenizer : failed test: 'She dances gracefully to the music.'
test_gpt_tokenizer : tokens in huggingface: She(3347),  dances(38207),  grace(11542), fully(2759),  to(284),  the(262),  music(2647), .(13), 
test_gpt_tokenizer : tokens in ggml: She(48256192),  dances(36553536),  graceful(38916096), ly(48650944),  to(48648128),  the(48645312),  music(48491840), .(48678976), 
test_gpt_tokenizer : failed test: 'The birds are chirping in the trees.'
test_gpt_tokenizer : tokens in huggingface: The(464),  birds(10087),  are(389),  ch(442), ir(343), ping(13886),  in(287),  the(262),  trees(7150), .(13), 
test_gpt_tokenizer : tokens in ggml: The(48638400),  birds(34796736),  are(48628800),  chi(36165888), r(48687680), ping(35478976),  in(48648512),  the(48645312),  trees(48805952), .(48678976), 
test_gpt_tokenizer : failed test: 'The flowers are blooming in the garden.'
test_gpt_tokenizer : tokens in huggingface: The(464),  flowers(12734),  are(389),  blo(24924), oming(3383),  in(287),  the(262),  garden(11376), .(13), 
test_gpt_tokenizer : tokens in ggml: The(48638400),  flowers(35626432),  are(48628800),  bloom(37501888), ing(48647360),  in(48648512),  the(48645312),  garden(34633408), .(48678976), 
test_gpt_tokenizer : failed test: 'The flowers in the garden are blooming.'
test_gpt_tokenizer : tokens in huggingface: The(464),  flowers(12734),  in(287),  the(262),  garden(11376),  are(389),  blo(24924), oming(3383), .(13), 
test_gpt_tokenizer : tokens in ggml: The(48638400),  flowers(35626432),  in(48648512),  the(48645312),  garden(34633408),  are(48628800),  bloom(37501888), ing(48647360), .(48678976), 
test_gpt_tokenizer : failed test: 'Wh@t's y0ur f@v0rite m0vie?'
test_gpt_tokenizer : tokens in huggingface: Wh(1199), @(31), t(83), 's(338),  y(331), 0(15), ur(333),  f(277), @(31), v(85), 0(15), rite(6525),  m(285), 0(15), v(85), ie(494), ?(30), 
test_gpt_tokenizer : tokens in ggml: Wh(48535872), @(48681280), t(48687936), 's(48655040),  y(48654144), 0(48679232), ur(48654400),  f(48647232), @(48681280), v(48688192), 0(48679232), rite(49020864),  m(48648256), 0(48679232), vi(35071168), e(48686016), ?(48681152), 
test_gpt_tokenizer : 7 tests failed out of 100 tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants