Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQuAD preprocessing not working for roberta (wrong p_mask) #2788

Closed
tholor opened this issue Feb 9, 2020 · 11 comments · Fixed by #4049
Closed

SQuAD preprocessing not working for roberta (wrong p_mask) #2788

tholor opened this issue Feb 9, 2020 · 11 comments · Fixed by #4049
Assignees
Labels
Core: Pipeline Internals of the library; Pipeline. Ex: Question Answering Should Fix This has been identified as a bug and should be fixed. Usage General questions about the library

Comments

@tholor
Copy link
Contributor

tholor commented Feb 9, 2020

Description
The pipeline for QA crashes for roberta models.
It's loading the model and tokenizer correctly, but the SQuAD preprocessing produces a wrong p_mask leading to no possible prediction and the error message below.

The observed p_mask for a roberta model is
[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]

while it should only mask the question tokens like this
[0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, ...]

I think the deeper root cause here is that roberta's token_type_ids returned from encode_plus are now all zeros (introduced in #2432) and the creation of p_mask in squad_convert_example_to_features relies on this information:

# p_mask: mask with 1 for token than cannot be in the answer (0 for token which can be in an answer)
# Original TF implem also keep the classification token (set to 0) (not sure why...)
p_mask = np.array(span["token_type_ids"])
p_mask = np.minimum(p_mask, 1)
if tokenizer.padding_side == "right":
# Limit positive values to one
p_mask = 1 - p_mask
p_mask[np.where(np.array(span["input_ids"]) == tokenizer.sep_token_id)[0]] = 1
# Set the CLS index to '0'
p_mask[cls_index] = 0

Haven't checked yet, but this might also affect training/eval if p_mask is used there.

How to reproduce?

model_name = "deepset/roberta-base-squad2"
nlp = pipeline('question-answering', model=model_name, tokenizer=model_name)
res = nlp({
    'question': 'What is roberta?',
    'context': 'Roberta is a language model that was trained for a longer time, on more data, without NSP'
}) 

results in

  File "/home/mp/deepset/dev/transformers/src/transformers/pipelines.py", line 847, in __call__
    for s, e, score in zip(starts, ends, scores)
  File "/home/mp/deepset/dev/transformers/src/transformers/pipelines.py", line 847, in <listcomp>
    for s, e, score in zip(starts, ends, scores)
KeyError: 0

Environment

  • Ubuntu 18.04
  • Python 3.7.6
  • PyTorch 1.3.1
@LysandreJik LysandreJik added Core: Pipeline Internals of the library; Pipeline. Ex: Question Answering Usage General questions about the library labels Feb 10, 2020
@LysandreJik LysandreJik self-assigned this Feb 10, 2020
@chinisan
Copy link

I think I have a problem that is related regarding training/evaluation using run_squad.py.

I wanted to train a roberta model on my own Q&A dataset mixed with the SQuAD dataset by running:

python ./examples/run_squad.py --output_dir=/home/jupyter/sec_roberta/roberta-base-mixed-quad --model_type=roberta --model_name_or_path=roberta-large --do_train --train_file=../sec_roberta/financial_and_squad2_train.json --do_eval --predict_file=../sec_roberta/financial_and_squad2_dev.json --learning_rate=1.5e-5 --num_train_epochs=2 --max_seq_length 384 --doc_stride 128 --overwrite_output_dir --per_gpu_train_batch_size=6 --per_gpu_eval_batch_size=6 --warmup_steps 500 --weight_decay 0.01 --version_2_with_negative

I ran into this error:

02/12/2020 08:22:38 - INFO - __main__ -   Creating features from dataset file at .
--
0%\|                                                                                                                                                      \| 0/542 [00:00<?, ?it/s]
Traceback (most recent call last):  File "./examples/run_squad.py", line 853, in <module>    main()  File "./examples/run_squad.py", line 791, in main
train_dataset = load_and_cache_examples(args, tokenizer, evaluate=False, output_examples=False)
File "./examples/run_squad.py", line 474, in load_and_cache_examples
examples = processor.get_train_examples(args.data_dir, filename=args.train_file)
File "/opt/anaconda3/lib/python3.7/site-packages/transformers/data/processors/squad.py", line 501, in get_train_examples
return self._create_examples(input_data, "train")
File "/opt/anaconda3/lib/python3.7/site-packages/transformers/data/processors/squad.py", line 559, in _create_examples
answers=answers,
File "/opt/anaconda3/lib/python3.7/site-packages/transformers/data/processors/squad.py", line 633, in __init__
self.start_position = char_to_word_offset[start_position_character]
IndexError: list index out of range

I tested my dataset on roberta-base and it works, so I don't necessarily think my dataset is the issue.

Also, I ran the same code using the SQuAD 2.0 dataset on roberta large and also on a lm-finetuned version of roberta large and both work, so this is all very mysterious to me.

I thought it could be related.

@LysandreJik LysandreJik added the Should Fix This has been identified as a bug and should be fixed. label Feb 24, 2020
@joshuawagner93
Copy link

joshuawagner93 commented Mar 26, 2020

Update: a fresh install of transformers fixed it for me...
i run into a similar error when trying to use the run_squad.py example to train roberta-large on Squad 2.0
when i run
export DATA_DIR=./data python ./transformers/examples/run_squad.py \ --model_type roberta \ --model_name_or_path roberta-large \ --do_train \ --do_eval \ --version_2_with_negative \ --train_file $DATA_DIR/squad2/train-v2.0.json \ --predict_file $DATA_DIR/squad2/dev-v2.0.json \ --per_gpu_eval_batch_size=6 \ --per_gpu_train_batch_size=6 \ --learning_rate 3e-5 \ --num_train_epochs 2.0 \ --overwrite_output_dir \ --overwrite_cache \ --max_seq_length 384 \ --doc_stride 128 \ --save_steps 100000 \ --output_dir ./roberta_squad/

i get the following error:

Traceback (most recent call last):
File "/opt/anaconda3/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/opt/anaconda3/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "/home/joshua_wagner/.local/lib/python3.7/site-packages/transformers/data/processors/squad.py", line 198, in
squad_convert_example_to_features
p_mask = np.array(span["token_type_ids"])
KeyError: 'token_type_ids'

Environment:

  • Debian GNU/Linux 9.11
  • Python 3.7
  • PyTorch 1.4.0

@borhenryk
Copy link
Contributor

same error as @joshuawagner93

@LysandreJik
Copy link
Member

@joshuawagner93 @HenrykBorzymowski, this issue should have been patched with #3439. Could you install the latest release and let me know if it fixes your issue?

@borhenryk
Copy link
Contributor

@LysandreJik works perfectly fine! Thx

@joshuawagner93
Copy link

@LysandreJik reinstall fixed the issue, thank you

@tholor
Copy link
Contributor Author

tholor commented Apr 27, 2020

@LysandreJik Unfortunately, we still face the same issue when we try to use roberta in the pipeline for inference. #3439 didn't seem to help for this.

@LysandreJik
Copy link
Member

LysandreJik commented Apr 28, 2020

Hi @tholor, indeed, it seems I thought this issue was resolved when it really wasn't. I just opened #4049 which should fix the issue.

@tholor
Copy link
Contributor Author

tholor commented Apr 28, 2020

Awesome, thanks for working on this @LysandreJik!

@LysandreJik
Copy link
Member

@tholor, the PR should be merged soon, thank you for your patience!

@tholor
Copy link
Contributor Author

tholor commented May 7, 2020

Great, thank you! Looking forward to it :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Core: Pipeline Internals of the library; Pipeline. Ex: Question Answering Should Fix This has been identified as a bug and should be fixed. Usage General questions about the library
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants