Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

examples : add tokenization tests and refactor codes #186

Merged
merged 26 commits into from
May 27, 2023
Merged

examples : add tokenization tests and refactor codes #186

merged 26 commits into from
May 27, 2023

Conversation

jaeminSon
Copy link
Contributor

(An attempt to fix #170)

examples : test tokenization + support utf-8 and multi-byte encodings + refactor gpt_tokenize + refactor convert-h5-to-ggml.py

Comment on lines 239 to 240
std::wstring text_multibytes = convert_to_wstring(text);
std::string utf8conv = convert_to_utf8(text_multibytes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted this in pr #184 as it breaks other models, see comment in pr #179 . I think each model will have to handle the encoding of its tokens.

examples/common.cpp Outdated Show resolved Hide resolved
examples/common.cpp Outdated Show resolved Hide resolved
examples/prompts/dolly-v2.txt Outdated Show resolved Hide resolved
examples/common.cpp Outdated Show resolved Hide resolved
Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job 👍

After resolving the merge conflicts, we can merge it

@jaeminSon jaeminSon changed the title examples : add tokenization tests for gpt-neox and refactor codes examples : add tokenization tests and refactor codes May 25, 2023
@ggerganov ggerganov merged commit 765c9bc into ggerganov:master May 27, 2023
@klosax klosax mentioned this pull request May 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

gpt_tokenize : bug in tokenization and incapable of double-byte languages
3 participants