"Please select a token to use as `pad_token`" error for `alpaca-lora-7b` model #434

oshev · 2023-04-24T20:57:28Z

When I run it on alpaca-lora-7b model this way python main.py --model hf-causal-experimental --model_args pretrained=chainyo/alpaca-lora-7b --tasks qasper --device cuda:4, I get an error:

Traceback (most recent call last):
  File "main.py", line 108, in <module>
    main()
  File "main.py", line 79, in main
    results = evaluator.simple_evaluate(
  File "/home/user/code/lm-evaluation-harness/lm_eval/utils.py", line 182, in _wrapper
    return fn(*args, **kwargs)
  File "/home/user/code/lm-evaluation-harness/lm_eval/evaluator.py", line 86, in simple_evaluate
    results = evaluate(
  File "/home/user/code/lm-evaluation-harness/lm_eval/utils.py", line 182, in _wrapper
    return fn(*args, **kwargs)
  File "/home/user/code/lm-evaluation-harness/lm_eval/evaluator.py", line 247, in evaluate
    resps = getattr(lm, reqtype)([req.args for req in reqs])
  File "/home/user/code/lm-evaluation-harness/lm_eval/base.py", line 820, in fn
    rem_res = getattr(self.lm, attr)(remaining_reqs)
  File "/home/user/code/lm-evaluation-harness/lm_eval/models/huggingface.py", line 395, in greedy_until
    token_context = self.tok_encode_batch(context)
  File "/home/user/code/lm-evaluation-harness/lm_eval/models/huggingface.py", line 354, in tok_encode_batch
    return self.tokenizer(
  File "/home/user/code/lm-evaluation-harness/venv/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2538, in __call__
    encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
  File "/home/user/code/lm-evaluation-harness/venv/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2624, in _call_one
    return self.batch_encode_plus(
  File "/home/user/code/lm-evaluation-harness/venv/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2806, in batch_encode_plus
    padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies(
  File "/home/user/code/lm-evaluation-harness/venv/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2443, in _get_padding_truncation_strategies
    raise ValueError(
ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.

Same problem when I try alpaca with squad2 dataset.

Note that this dataset works fine with dolly model. I tested it with dolly-v2-12b (command python main.py --model hf-causal-experimental --model_args pretrained=databricks/dolly-v2-12b --tasks qasper --device cuda:4).

It gives tons of repeated messages like this:

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
 46%|████████████████████████████████████████████████████████████████████████████▋                                                                                        | 165/355 

[07:02<07:25,  2.34s/it]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
 47%|█████████████████████████████████████████████████████████████████████████████▏                                                                                       | 166/355 

[07:05<08:45,  2.78s/it]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
 47%|█████████████████████████████████████████████████████████████████████████████▌                                                                                       | 167/355 

[07:06<06:53,  2.20s/it]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
 47%|██████████████████████████████████████████████████████████████████████████████                                                                                       | 168/355 

[07:07<05:35,  1.79s/it]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
 48%|██████████████████████████████████████████████████████████████████████████████▌

But I do get the metrics at the end in the table:

hf-causal-experimental (pretrained=databricks/dolly-v2-12b), limit: None, provide_description: False, num_fewshot: 0, batch_size: None
| Task |Version|    Metric    |Value |   |Stderr|
|------|------:|--------------|-----:|---|-----:|
|qasper|      0|f1_yesno      |0.5403|±  |0.0416|
|      |       |f1_abstractive|0.1351|±  |0.0064|

The text was updated successfully, but these errors were encountered:

haileyschoelkopf · 2023-04-25T22:18:43Z

Thanks for opening an issue!!

Re: Alpaca, you can fix this error by, right after the tokenizer is initialized, setting

tokenizer.pad_token = tokenizer.eos_token

And this should fix the error. I can consider how we want to allow users to pass this via command line.

For the latter logging message, I believe the same line as above will silence this error? You'll get the same results either way though.

oshev · 2023-04-26T10:26:08Z

Thanks for looking into this!

I tried to add the code you suggested to huggingface.py on the line 176 but this didn't solve the problem.

In fact, I found this code is already in this file in _create_auto_tokenizer function, on the line 283.

oshev · 2023-04-26T15:44:35Z

I did a bit of digging. The problem is that for Alpaca model tokenizer.eos_token is empty.

Instead of doing what you suggested, I substituted that line 283 as follows:

tokenizer.pad_token = '</s>'

(to my knowledge, is the eos token for Alpaca, but in any case using either [PAD], <s>, or </s> doesn't change evaluation results)

This change caused another error at line 407.

I changed this

                for term in until:
                    response = response.split(term)[0]

with this:

                    if term:
                        response = response.split(term)[0]
                    else:
                        response = response.strip()

Now, I can run evaluation on Alpaca (and Dolly) and get results. I'm not sure if they are correct, though. On qasper, I get better results for Dolly than for Alpaca, which seems a bit suspicious.

The issue of no pad for alpaca model might be related to problems as follows:

huggingface/transformers#22312
https://github.com/huggingface/transformers/pull/22402/files

In any case, it would be good if lm-evaluation-harness could handle this.

haileyschoelkopf · 2023-04-27T18:13:13Z

Thanks for looking into this, I will fix this issue at line 407!

I'll need to do some more digging on the Alpaca issue though. If it's an issue where only this alpaca upload does not work when doing the tokenizer.pad_token = tokenizer.eos_token fix, then we probably don't want to handle it separately in the code.

oshev · 2023-05-02T18:02:01Z

I don't think tokenizer.pad_token = tokenizer.eos_token not working is exclusively Alpaca's problem. It's rather a problem for any model where eos_token isn't set. It would be good to detect this situation and have some fallback.

ylwangy · 2023-05-22T06:58:04Z

I have the same issue.

oshev changed the title ~~Please select a token to use as pad_token error for alpaca-lora-7b model~~ "Please select a token to use as pad_token" error for alpaca-lora-7b model Apr 24, 2023

oshev mentioned this issue Apr 24, 2023

TypeError: Evaluating on squad_v2 #412

Closed

StellaAthena self-assigned this Nov 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Please select a token to use as `pad_token`" error for `alpaca-lora-7b` model #434

"Please select a token to use as `pad_token`" error for `alpaca-lora-7b` model #434

oshev commented Apr 24, 2023

haileyschoelkopf commented Apr 25, 2023

oshev commented Apr 26, 2023 •

edited

Loading

oshev commented Apr 26, 2023 •

edited

Loading

haileyschoelkopf commented Apr 27, 2023

oshev commented May 2, 2023

ylwangy commented May 22, 2023

"Please select a token to use as pad_token" error for alpaca-lora-7b model #434

"Please select a token to use as pad_token" error for alpaca-lora-7b model #434

Comments

oshev commented Apr 24, 2023

haileyschoelkopf commented Apr 25, 2023

oshev commented Apr 26, 2023 • edited Loading

oshev commented Apr 26, 2023 • edited Loading

haileyschoelkopf commented Apr 27, 2023

oshev commented May 2, 2023

ylwangy commented May 22, 2023

"Please select a token to use as `pad_token`" error for `alpaca-lora-7b` model #434

"Please select a token to use as `pad_token`" error for `alpaca-lora-7b` model #434

oshev commented Apr 26, 2023 •

edited

Loading

oshev commented Apr 26, 2023 •

edited

Loading