EleutherAI/pythia-800m tokenizer adds unusual kwargs, causes a ValueError when evaluating model #36

ejmichaud · 2022-12-19T21:37:28Z

It seems like the EleutherAI/pythia-800m tokenizer includes 'token_type_ids' values, but these lead to a ValueError when evaluating the following code:

from transformers import GPTNeoXForCausalLM, AutoTokenizer

model = GPTNeoXForCausalLM.from_pretrained(
  "EleutherAI/pythia-800m",
  revision="step143000",
  cache_dir=".pythia-800m/step143000",
)

tokenizer = AutoTokenizer.from_pretrained(
  "EleutherAI/pythia-800m",
  revision="step143000",
  cache_dir="./pythia-800m/step143000",
)

inputs = tokenizer("Hello, I am", return_tensors="pt")
model.generate(**inputs)

Here is the stack trace:

Traceback (most recent call last):
  File "eval.py", line 76, in <module>
    outputs = model.generate(**inputs, temperature=0.0, max_new_tokens=40)
  File "/om2/user/ericjm/miniconda3/envs/phase-changes/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/om2/user/ericjm/miniconda3/envs/phase-changes/lib/python3.8/site-packages/transformers/generation/utils.py", line 1296, in generate
    self._validate_model_kwargs(model_kwargs.copy())
  File "/om2/user/ericjm/miniconda3/envs/phase-changes/lib/python3.8/site-packages/transformers/generation/utils.py", line 993, in _validate_model_kwargs
    raise ValueError(
ValueError: The following `model_kwargs` are not used by the model: ['token_type_ids'] (note: typos in the generate arguments will also show up in this list)

I can get around this error by simply using a tokenizer from another one of the models. This tokenizer, for instance, works:

tokenizer = AutoTokenizer.from_pretrained(
  "EleutherAI/pythia-19m",
  revision="step143000",
  cache_dir="./pythia-19m/step143000",
)

It seems like the tokenizers are the same for all the models, so this issue is pretty easy to get around, but I just thought I'd report it.

The text was updated successfully, but these errors were encountered:

haileyschoelkopf · 2022-12-20T05:10:13Z

Darn, I thought I'd caught all of these. I'll fix this tomorrow, thanks for raising the issue! :)

Yup, any other tokenizer'd work!

haileyschoelkopf · 2022-12-20T14:45:41Z

Patched for all 800m intermediate ckpts!

haileyschoelkopf closed this as completed Dec 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EleutherAI/pythia-800m tokenizer adds unusual kwargs, causes a ValueError when evaluating model #36

EleutherAI/pythia-800m tokenizer adds unusual kwargs, causes a ValueError when evaluating model #36

ejmichaud commented Dec 19, 2022

haileyschoelkopf commented Dec 20, 2022

haileyschoelkopf commented Dec 20, 2022

EleutherAI/pythia-800m tokenizer adds unusual kwargs, causes a ValueError when evaluating model #36

EleutherAI/pythia-800m tokenizer adds unusual kwargs, causes a ValueError when evaluating model #36

Comments

ejmichaud commented Dec 19, 2022

haileyschoelkopf commented Dec 20, 2022

haileyschoelkopf commented Dec 20, 2022