Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load sentencepiece tokenizer for evaluation #1904

Closed
ayushsml opened this issue May 30, 2024 · 2 comments
Closed

Load sentencepiece tokenizer for evaluation #1904

ayushsml opened this issue May 30, 2024 · 2 comments

Comments

@ayushsml
Copy link

Hi,

My tokenizer was trained using Google's sentencepiece library which produces .model and .vocab file. lm-eval requires a .json file in the transformers.AutoTokenizer.from_pretrained in this line.
Few discussions suggested to convert spm tokenizer to hf and use save_pretrained function.
However, these suggestions are not working.

I am getting the following error while running without passing json file

OSError: trained-tokenizer does not appear to have a file named config.json

@haileyschoelkopf
Copy link
Contributor

Hi! I think this may be a HF tokenizers issue--have you tried asking them, or in the HF discord?

We do also support passing in an already-initialized HF tokenizer object, which may help--but still requires loading the tokenizer into a HF Tokenizer class. So you could use the from_spm suggestion, and pass that tokenizer into HFLM(tokenizer=my_spm_initialized_tok). Hope this helps.

@ayushsml
Copy link
Author

Thanks Hailey for the reply. We have resolved the issue now. Documenting the solution here.

The issue is with this line which was treating BPE as a unigram tokenizer.

We modified the SPMConverter class to ensure that BPE tokenizer goes to the proper conditional statement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants