-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add QASPER task #264
Add QASPER task #264
Conversation
lm_eval/tasks/qasper.py
Outdated
A dictionary where keys are the names of submetrics and values are | ||
whether a higher value of the submetric is better | ||
""" | ||
return {"f1_un": True, "f1_yn": True, "f1_ab": True, "f1_ex": True} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can you make these the full names? like f1_unanswerable, f1_yesno, etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
lm_eval/tasks/qasper.py
Outdated
return res_dict | ||
|
||
def aggregation(self): | ||
return {"f1_un": f1_score, "f1_yn": f1_score, "f1_ab": mean, "f1_ex": mean} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double checking, are the latter 2 supposed to be mean, and is yesno supposed to be f1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f1_abstractive and f1_extractive are themselves already F1 scores, so the original evaluation calculates the mean over F1 scores.
lm_eval/tasks/qasper.py
Outdated
) | ||
|
||
def doc_to_target(self, doc): | ||
# this method is invoked by tests only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is used for fewshot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got rid of this
lm_eval/tasks/qasper.py
Outdated
DATASET_NAME = None | ||
|
||
def doc_to_text(self, doc): | ||
# this method is invoked by tests only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this method is used to create the doc
that gets passed to construct_requests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got rid of this
Codecov Report
@@ Coverage Diff @@
## master #264 +/- ##
==========================================
- Coverage 95.52% 95.37% -0.16%
==========================================
Files 46 47 +1
Lines 3758 3873 +115
==========================================
+ Hits 3590 3694 +104
- Misses 168 179 +11
Continue to review full report at Codecov.
|
I ran it on gpt3 and got all zeros except for f1_abstractive which is 0.13. The zeroes don't seem correct |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs double check for why the tasks have f1 score 0
Add QASPER task
Add QASPER task
Closes #184