-
Notifications
You must be signed in to change notification settings - Fork 966
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gpt_tokenize : bug in tokenization and incapable of double-byte languages #170
Comments
I found a quick fix: Replace line: Line 262 in 2a75bd4
with:
Perplexity tests: model = mpt-7b-base-ggml-f16.bin Without this fix: With this fix: |
@klosax Thanks! Btw, would be nice to add a simple perplexity tool as a @jaeminSon Do you still observe issues with latest |
I am currently working on a perplexity tool. The tool should be able to measure perplexity correctly on wiki.test.raw, but it seems that the gpt tokenizer does not work correctly with unicode characters. |
@ggerganov The current version fixes the above issue! LoL It would be great if perplexities are compared between huggingface model and the converted ggml model!!! Also, it would be nice to have test codes for tokenizers, perhaps something like this code (https://github.com/ggerganov/llama.cpp/blob/master/tests/test-tokenizer-0.cpp). |
Yup, all great suggestions! PRs welcome |
Each individual model possesses its own distinct tokenizer, even though transformer architecture is identical. Conducting general tokenization tests means evaluating all currently available models, which is untanable. In ggml repo, individual architecture is separately converted and inferenced in ‘examples’ directory, test approaches may differ architecture by architecture. In such a context, I added a function that checks the correctness of tokenization depending on the model in a somewhat limited sense in main.cpp under gpt-neox, which I’m currently eager to use with ggml. |
How so? It does not sound too difficult Here is one possible approach:
|
I also considered that option. But I worried that there could be multiple models for different language even for the same architecture. Also models with different param sizes may have different vocabularies, thus different tokenizers. When I checked how llama.cpp, they checked with minimal cases. Still, as their primary target is just llama, testing tokenizer would be easier than ggml. Huggingface transformer also checks the tokenizer quite minimally (case of unigram - https://github.com/huggingface/tokenizers/blob/main/tokenizers/tests/unigram.rs) Perhaps I can try to cover as many models as I can with english language tokens. |
As suggested here, I listed several prompts to test tokenization (under examples/prompts/test-cases.txt) and saved the tokenization results using the huggingface tokenizers. But there are several modification from your suggestions.
In terms of testing the modified codes, I tested the code with polyglot-ko-1.3b. I will check other models sooner or later. |
replit uses its own method and struct for tokenization in its main.cpp. I will put test_tokenizer in main.cpp |
Cerebras-GPT-111M has a bit different tokenization compared with huggingface on random english texts (93 matched out of 100)
|
There is a bug in gpt_tokenize function in examples/common.cpp
Followings are comparison with a huggingface tokenizer,
Example above is due to this line
ggml/examples/common.cpp
Line 269 in 2a75bd4
Furthermore, current implementation does not consider double-byte characters such as Korean, Japanese, Chinese. I'm wondering whether it would be ideal to modify the gpt_tokenize function to be able to handle double-byte characters, or the repo aims on single character languages only as English and most roman languages are single-byte.
The text was updated successfully, but these errors were encountered: