SQuAD preprocessing not working for roberta (wrong p_mask) #2788

tholor · 2020-02-09T08:32:36Z

Description
The pipeline for QA crashes for roberta models.
It's loading the model and tokenizer correctly, but the SQuAD preprocessing produces a wrong p_mask leading to no possible prediction and the error message below.

The observed p_mask for a roberta model is
[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]

while it should only mask the question tokens like this
[0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, ...]

I think the deeper root cause here is that roberta's token_type_ids returned from encode_plus are now all zeros (introduced in #2432) and the creation of p_mask in squad_convert_example_to_features relies on this information:

transformers/src/transformers/data/processors/squad.py

Lines 189 to 202 in 520e7f2

 # p_mask: mask with 1 for token than cannot be in the answer (0 for token which can be in an answer) 

 # Original TF implem also keep the classification token (set to 0) (not sure why...) 

 p_mask = np.array(span["token_type_ids"]) 

 p_mask = np.minimum(p_mask, 1) 

 if tokenizer.padding_side == "right": 

 # Limit positive values to one 

 p_mask = 1 - p_mask 

 p_mask[np.where(np.array(span["input_ids"]) == tokenizer.sep_token_id)[0]] = 1 

 # Set the CLS index to '0' 

 p_mask[cls_index] = 0

Haven't checked yet, but this might also affect training/eval if p_mask is used there.

How to reproduce?

model_name = "deepset/roberta-base-squad2"
nlp = pipeline('question-answering', model=model_name, tokenizer=model_name)
res = nlp({
    'question': 'What is roberta?',
    'context': 'Roberta is a language model that was trained for a longer time, on more data, without NSP'
})

results in

  File "/home/mp/deepset/dev/transformers/src/transformers/pipelines.py", line 847, in __call__
    for s, e, score in zip(starts, ends, scores)
  File "/home/mp/deepset/dev/transformers/src/transformers/pipelines.py", line 847, in <listcomp>
    for s, e, score in zip(starts, ends, scores)
KeyError: 0

Environment

Ubuntu 18.04
Python 3.7.6
PyTorch 1.3.1

The text was updated successfully, but these errors were encountered:

chinisan · 2020-02-12T09:04:06Z

I think I have a problem that is related regarding training/evaluation using run_squad.py.

I wanted to train a roberta model on my own Q&A dataset mixed with the SQuAD dataset by running:

python ./examples/run_squad.py --output_dir=/home/jupyter/sec_roberta/roberta-base-mixed-quad --model_type=roberta --model_name_or_path=roberta-large --do_train --train_file=../sec_roberta/financial_and_squad2_train.json --do_eval --predict_file=../sec_roberta/financial_and_squad2_dev.json --learning_rate=1.5e-5 --num_train_epochs=2 --max_seq_length 384 --doc_stride 128 --overwrite_output_dir --per_gpu_train_batch_size=6 --per_gpu_eval_batch_size=6 --warmup_steps 500 --weight_decay 0.01 --version_2_with_negative

I ran into this error:

02/12/2020 08:22:38 - INFO - __main__ -   Creating features from dataset file at .
--
0%\|                                                                                                                                                      \| 0/542 [00:00<?, ?it/s]
Traceback (most recent call last):  File "./examples/run_squad.py", line 853, in <module>    main()  File "./examples/run_squad.py", line 791, in main
train_dataset = load_and_cache_examples(args, tokenizer, evaluate=False, output_examples=False)
File "./examples/run_squad.py", line 474, in load_and_cache_examples
examples = processor.get_train_examples(args.data_dir, filename=args.train_file)
File "/opt/anaconda3/lib/python3.7/site-packages/transformers/data/processors/squad.py", line 501, in get_train_examples
return self._create_examples(input_data, "train")
File "/opt/anaconda3/lib/python3.7/site-packages/transformers/data/processors/squad.py", line 559, in _create_examples
answers=answers,
File "/opt/anaconda3/lib/python3.7/site-packages/transformers/data/processors/squad.py", line 633, in __init__
self.start_position = char_to_word_offset[start_position_character]
IndexError: list index out of range

I tested my dataset on roberta-base and it works, so I don't necessarily think my dataset is the issue.

Also, I ran the same code using the SQuAD 2.0 dataset on roberta large and also on a lm-finetuned version of roberta large and both work, so this is all very mysterious to me.

I thought it could be related.

joshuawagner93 · 2020-03-26T16:08:42Z

Update: a fresh install of transformers fixed it for me...
i run into a similar error when trying to use the run_squad.py example to train roberta-large on Squad 2.0
when i run
export DATA_DIR=./data python ./transformers/examples/run_squad.py \ --model_type roberta \ --model_name_or_path roberta-large \ --do_train \ --do_eval \ --version_2_with_negative \ --train_file $DATA_DIR/squad2/train-v2.0.json \ --predict_file $DATA_DIR/squad2/dev-v2.0.json \ --per_gpu_eval_batch_size=6 \ --per_gpu_train_batch_size=6 \ --learning_rate 3e-5 \ --num_train_epochs 2.0 \ --overwrite_output_dir \ --overwrite_cache \ --max_seq_length 384 \ --doc_stride 128 \ --save_steps 100000 \ --output_dir ./roberta_squad/

i get the following error:

Traceback (most recent call last):
File "/opt/anaconda3/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/opt/anaconda3/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "/home/joshua_wagner/.local/lib/python3.7/site-packages/transformers/data/processors/squad.py", line 198, in
squad_convert_example_to_features
p_mask = np.array(span["token_type_ids"])
KeyError: 'token_type_ids'

Environment:

Debian GNU/Linux 9.11
Python 3.7
PyTorch 1.4.0

borhenryk · 2020-03-26T16:36:18Z

same error as @joshuawagner93

LysandreJik · 2020-04-01T18:48:23Z

@joshuawagner93 @HenrykBorzymowski, this issue should have been patched with #3439. Could you install the latest release and let me know if it fixes your issue?

borhenryk · 2020-04-02T10:13:54Z

@LysandreJik works perfectly fine! Thx

joshuawagner93 · 2020-04-03T14:43:49Z

@LysandreJik reinstall fixed the issue, thank you

tholor · 2020-04-27T17:12:03Z

@LysandreJik Unfortunately, we still face the same issue when we try to use roberta in the pipeline for inference. #3439 didn't seem to help for this.

LysandreJik · 2020-04-28T16:40:46Z

Hi @tholor, indeed, it seems I thought this issue was resolved when it really wasn't. I just opened #4049 which should fix the issue.

tholor · 2020-04-28T17:36:15Z

Awesome, thanks for working on this @LysandreJik!

LysandreJik · 2020-05-07T15:23:15Z

@tholor, the PR should be merged soon, thank you for your patience!

tholor · 2020-05-07T19:56:55Z

Great, thank you! Looking forward to it :)

LysandreJik added Core: Pipeline Internals of the library; Pipeline. Ex: Question Answering Usage General questions about the library labels Feb 10, 2020

LysandreJik self-assigned this Feb 10, 2020

tholor mentioned this issue Feb 18, 2020

Haystack with Albert is awesome! XLNet question deepset-ai/haystack#23

Closed

LysandreJik added the Should Fix This has been identified as a bug and should be fixed. label Feb 24, 2020

LysandreJik mentioned this issue Apr 28, 2020

p_mask in SQuAD pre-processing #4049

Merged

LysandreJik closed this as completed in #4049 May 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SQuAD preprocessing not working for roberta (wrong p_mask) #2788

SQuAD preprocessing not working for roberta (wrong p_mask) #2788

tholor commented Feb 9, 2020

chinisan commented Feb 12, 2020

joshuawagner93 commented Mar 26, 2020 •

edited

Loading

borhenryk commented Mar 26, 2020

LysandreJik commented Apr 1, 2020

borhenryk commented Apr 2, 2020

joshuawagner93 commented Apr 3, 2020

tholor commented Apr 27, 2020

LysandreJik commented Apr 28, 2020 •

edited

Loading

tholor commented Apr 28, 2020

LysandreJik commented May 7, 2020

tholor commented May 7, 2020

SQuAD preprocessing not working for roberta (wrong p_mask) #2788

SQuAD preprocessing not working for roberta (wrong p_mask) #2788

Comments

tholor commented Feb 9, 2020

chinisan commented Feb 12, 2020

joshuawagner93 commented Mar 26, 2020 • edited Loading

borhenryk commented Mar 26, 2020

LysandreJik commented Apr 1, 2020

borhenryk commented Apr 2, 2020

joshuawagner93 commented Apr 3, 2020

tholor commented Apr 27, 2020

LysandreJik commented Apr 28, 2020 • edited Loading

tholor commented Apr 28, 2020

LysandreJik commented May 7, 2020

tholor commented May 7, 2020

joshuawagner93 commented Mar 26, 2020 •

edited

Loading

LysandreJik commented Apr 28, 2020 •

edited

Loading