Weird inconsistency in Tokenizer vocabulary #151

javirandor · 2024-03-04T09:44:50Z

Hello everyone!

I found a weird inconsistency in the tokenizer vocabulary. I wanted to ask why this could be happening.

I have loaded a tokenizer from HF:

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-160m")

If I run

tokenizer.encode("\u200b")

The output is [12882]. However, taking a look at the vocabulary used for training (here), I cannot find the token \u200b and the token id corresponds to a different string

"\u00e2\u0122\u012d": 12882,

This seems to generally happen with unicode characters.

Why could this be happening?? I just want to make sure that the tokenizer I use for training is equivalent to the HF tokenizers since my training (as anticipated in your README) results in a weird tokenizer.

Thanks a lot :)

The text was updated successfully, but these errors were encountered:

haileyschoelkopf · 2024-03-07T17:27:08Z

I don't know exactly what's going on here yet, but I can confirm this file at utils/20B_tokenizer.json is precisely the one used for vocab_file during Pythia training.

also the following snippet shows the result upon loading the two tokenizers and encoding \u200b:

>>> tok1 = transformers.PretrainedTokenizerFast.from_file("utils/20B_tokenizer.json")
>>> tok2 = transformers.AutoTokenizer.from_pretrained("EleutherAI/pythia-160m")

>>> tok1("\u200b")
{'input_ids': [12882], 'token_type_ids': [0], 'attention_mask': [1]}
>>> tok2("\u200b")
{'input_ids': [12882], 'attention_mask': [1]}

haileyschoelkopf self-assigned this Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weird inconsistency in Tokenizer vocabulary #151

Weird inconsistency in Tokenizer vocabulary #151

javirandor commented Mar 4, 2024 •

edited

Loading

haileyschoelkopf commented Mar 7, 2024

Weird inconsistency in Tokenizer vocabulary #151

Weird inconsistency in Tokenizer vocabulary #151

Comments

javirandor commented Mar 4, 2024 • edited Loading

haileyschoelkopf commented Mar 7, 2024

javirandor commented Mar 4, 2024 •

edited

Loading