Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Please select a token to use as pad_token" error for alpaca-lora-7b model #434

Open
oshev opened this issue Apr 24, 2023 · 6 comments
Open
Assignees

Comments

@oshev
Copy link

oshev commented Apr 24, 2023

When I run it on alpaca-lora-7b model this way python main.py --model hf-causal-experimental --model_args pretrained=chainyo/alpaca-lora-7b --tasks qasper --device cuda:4, I get an error:

Traceback (most recent call last):
  File "main.py", line 108, in <module>
    main()
  File "main.py", line 79, in main
    results = evaluator.simple_evaluate(
  File "/home/user/code/lm-evaluation-harness/lm_eval/utils.py", line 182, in _wrapper
    return fn(*args, **kwargs)
  File "/home/user/code/lm-evaluation-harness/lm_eval/evaluator.py", line 86, in simple_evaluate
    results = evaluate(
  File "/home/user/code/lm-evaluation-harness/lm_eval/utils.py", line 182, in _wrapper
    return fn(*args, **kwargs)
  File "/home/user/code/lm-evaluation-harness/lm_eval/evaluator.py", line 247, in evaluate
    resps = getattr(lm, reqtype)([req.args for req in reqs])
  File "/home/user/code/lm-evaluation-harness/lm_eval/base.py", line 820, in fn
    rem_res = getattr(self.lm, attr)(remaining_reqs)
  File "/home/user/code/lm-evaluation-harness/lm_eval/models/huggingface.py", line 395, in greedy_until
    token_context = self.tok_encode_batch(context)
  File "/home/user/code/lm-evaluation-harness/lm_eval/models/huggingface.py", line 354, in tok_encode_batch
    return self.tokenizer(
  File "/home/user/code/lm-evaluation-harness/venv/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2538, in __call__
    encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
  File "/home/user/code/lm-evaluation-harness/venv/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2624, in _call_one
    return self.batch_encode_plus(
  File "/home/user/code/lm-evaluation-harness/venv/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2806, in batch_encode_plus
    padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies(
  File "/home/user/code/lm-evaluation-harness/venv/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2443, in _get_padding_truncation_strategies
    raise ValueError(
ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.

Same problem when I try alpaca with squad2 dataset.

Note that this dataset works fine with dolly model. I tested it with dolly-v2-12b (command python main.py --model hf-causal-experimental --model_args pretrained=databricks/dolly-v2-12b --tasks qasper --device cuda:4).

It gives tons of repeated messages like this:

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
 46%|████████████████████████████████████████████████████████████████████████████▋                                                                                        | 165/355 

[07:02<07:25,  2.34s/it]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
 47%|█████████████████████████████████████████████████████████████████████████████▏                                                                                       | 166/355 

[07:05<08:45,  2.78s/it]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
 47%|█████████████████████████████████████████████████████████████████████████████▌                                                                                       | 167/355 

[07:06<06:53,  2.20s/it]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
 47%|██████████████████████████████████████████████████████████████████████████████                                                                                       | 168/355 

[07:07<05:35,  1.79s/it]Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
 48%|██████████████████████████████████████████████████████████████████████████████▌                                        

But I do get the metrics at the end in the table:

hf-causal-experimental (pretrained=databricks/dolly-v2-12b), limit: None, provide_description: False, num_fewshot: 0, batch_size: None
| Task |Version|    Metric    |Value |   |Stderr|
|------|------:|--------------|-----:|---|-----:|
|qasper|      0|f1_yesno      |0.5403|±  |0.0416|
|      |       |f1_abstractive|0.1351|±  |0.0064|
@oshev oshev changed the title Please select a token to use as pad_token error for alpaca-lora-7b model "Please select a token to use as pad_token" error for alpaca-lora-7b model Apr 24, 2023
@haileyschoelkopf
Copy link
Contributor

Thanks for opening an issue!!

Re: Alpaca, you can fix this error by, right after the tokenizer is initialized, setting

tokenizer.pad_token = tokenizer.eos_token

And this should fix the error. I can consider how we want to allow users to pass this via command line.

For the latter logging message, I believe the same line as above will silence this error? You'll get the same results either way though.

@oshev
Copy link
Author

oshev commented Apr 26, 2023

Thanks for looking into this!

I tried to add the code you suggested to huggingface.py on the line 176 but this didn't solve the problem.

In fact, I found this code is already in this file in _create_auto_tokenizer function, on the line 283.

@oshev
Copy link
Author

oshev commented Apr 26, 2023

I did a bit of digging. The problem is that for Alpaca model tokenizer.eos_token is empty.

Instead of doing what you suggested, I substituted that line 283 as follows:

tokenizer.pad_token = '</s>'

(to my knowledge, is the eos token for Alpaca, but in any case using either [PAD], <s>, or </s> doesn't change evaluation results)

This change caused another error at line 407.

I changed this

                for term in until:
                    response = response.split(term)[0]

with this:

                    if term:
                        response = response.split(term)[0]
                    else:
                        response = response.strip()

Now, I can run evaluation on Alpaca (and Dolly) and get results. I'm not sure if they are correct, though. On qasper, I get better results for Dolly than for Alpaca, which seems a bit suspicious.

The issue of no pad for alpaca model might be related to problems as follows:

huggingface/transformers#22312
https://github.com/huggingface/transformers/pull/22402/files

In any case, it would be good if lm-evaluation-harness could handle this.

@haileyschoelkopf
Copy link
Contributor

Thanks for looking into this, I will fix this issue at line 407!

I'll need to do some more digging on the Alpaca issue though. If it's an issue where only this alpaca upload does not work when doing the tokenizer.pad_token = tokenizer.eos_token fix, then we probably don't want to handle it separately in the code.

@oshev
Copy link
Author

oshev commented May 2, 2023

I don't think tokenizer.pad_token = tokenizer.eos_token not working is exclusively Alpaca's problem. It's rather a problem for any model where eos_token isn't set. It would be good to detect this situation and have some fallback.

@ylwangy
Copy link

ylwangy commented May 22, 2023

I have the same issue.

@StellaAthena StellaAthena self-assigned this Nov 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants