-
Notifications
You must be signed in to change notification settings - Fork 25.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SQuAD preprocessing not working for roberta (wrong p_mask) #2788
Comments
I think I have a problem that is related regarding training/evaluation using run_squad.py. I wanted to train a roberta model on my own Q&A dataset mixed with the SQuAD dataset by running:
I ran into this error:
I tested my dataset on roberta-base and it works, so I don't necessarily think my dataset is the issue. Also, I ran the same code using the SQuAD 2.0 dataset on roberta large and also on a lm-finetuned version of roberta large and both work, so this is all very mysterious to me. I thought it could be related. |
Update: a fresh install of transformers fixed it for me... i get the following error:
Environment:
|
same error as @joshuawagner93 |
@joshuawagner93 @HenrykBorzymowski, this issue should have been patched with #3439. Could you install the latest release and let me know if it fixes your issue? |
@LysandreJik works perfectly fine! Thx |
@LysandreJik reinstall fixed the issue, thank you |
@LysandreJik Unfortunately, we still face the same issue when we try to use roberta in the pipeline for inference. #3439 didn't seem to help for this. |
Awesome, thanks for working on this @LysandreJik! |
@tholor, the PR should be merged soon, thank you for your patience! |
Great, thank you! Looking forward to it :) |
Description
The pipeline for QA crashes for roberta models.
It's loading the model and tokenizer correctly, but the SQuAD preprocessing produces a wrong
p_mask
leading to no possible prediction and the error message below.The observed
p_mask
for a roberta model is[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]
while it should only mask the question tokens like this
[0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, ...]
I think the deeper root cause here is that roberta's
token_type_ids
returned fromencode_plus
are now all zeros (introduced in #2432) and the creation ofp_mask
insquad_convert_example_to_features
relies on this information:transformers/src/transformers/data/processors/squad.py
Lines 189 to 202 in 520e7f2
Haven't checked yet, but this might also affect training/eval if
p_mask
is used there.How to reproduce?
results in
Environment
The text was updated successfully, but these errors were encountered: