-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Load sentencepiece tokenizer for evaluation #1904
Comments
Hi! I think this may be a HF tokenizers issue--have you tried asking them, or in the HF discord? We do also support passing in an already-initialized HF tokenizer object, which may help--but still requires loading the tokenizer into a HF Tokenizer class. So you could use the |
Thanks Hailey for the reply. We have resolved the issue now. Documenting the solution here. The issue is with this line which was treating BPE as a unigram tokenizer. We modified the SPMConverter class to ensure that BPE tokenizer goes to the proper conditional statement. |
Hi,
My tokenizer was trained using Google's sentencepiece library which produces
.model
and.vocab
file. lm-eval requires a.json
file in thetransformers.AutoTokenizer.from_pretrained
in this line.Few discussions suggested to convert spm tokenizer to hf and use
save_pretrained
function.However, these suggestions are not working.
I am getting the following error while running without passing json file
OSError: trained-tokenizer does not appear to have a file named config.json
The text was updated successfully, but these errors were encountered: