Load sentencepiece tokenizer for evaluation #1904

ayushsml · 2024-05-30T09:04:52Z

Hi,

My tokenizer was trained using Google's sentencepiece library which produces .model and .vocab file. lm-eval requires a .json file in the transformers.AutoTokenizer.from_pretrained in this line.
Few discussions suggested to convert spm tokenizer to hf and use save_pretrained function.
However, these suggestions are not working.

I am getting the following error while running without passing json file

OSError: trained-tokenizer does not appear to have a file named config.json

The text was updated successfully, but these errors were encountered:

haileyschoelkopf · 2024-05-30T17:29:08Z

Hi! I think this may be a HF tokenizers issue--have you tried asking them, or in the HF discord?

We do also support passing in an already-initialized HF tokenizer object, which may help--but still requires loading the tokenizer into a HF Tokenizer class. So you could use the from_spm suggestion, and pass that tokenizer into HFLM(tokenizer=my_spm_initialized_tok). Hope this helps.

ayushsml · 2024-05-30T18:05:10Z

Thanks Hailey for the reply. We have resolved the issue now. Documenting the solution here.

The issue is with this line which was treating BPE as a unigram tokenizer.

We modified the SPMConverter class to ensure that BPE tokenizer goes to the proper conditional statement.

haileyschoelkopf closed this as completed May 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load sentencepiece tokenizer for evaluation #1904

Load sentencepiece tokenizer for evaluation #1904

ayushsml commented May 30, 2024

haileyschoelkopf commented May 30, 2024

ayushsml commented May 30, 2024

Load sentencepiece tokenizer for evaluation #1904

Load sentencepiece tokenizer for evaluation #1904

Comments

ayushsml commented May 30, 2024

haileyschoelkopf commented May 30, 2024

ayushsml commented May 30, 2024