Add QASPER task #264

StephenHogg · 2022-02-08T11:04:44Z

Closes #184

…rocess_results

leogao2 · 2022-02-11T23:44:16Z

lm_eval/tasks/qasper.py

+ A dictionary where keys are the names of submetrics and values are
+ whether a higher value of the submetric is better
+ """
+ return {"f1_un": True, "f1_yn": True, "f1_ab": True, "f1_ex": True}


nit: can you make these the full names? like f1_unanswerable, f1_yesno, etc

leogao2 · 2022-02-11T23:46:06Z

lm_eval/tasks/qasper.py

+ return res_dict
+
+ def aggregation(self):
+ return {"f1_un": f1_score, "f1_yn": f1_score, "f1_ab": mean, "f1_ex": mean}


double checking, are the latter 2 supposed to be mean, and is yesno supposed to be f1?

f1_abstractive and f1_extractive are themselves already F1 scores, so the original evaluation calculates the mean over F1 scores.

leogao2 · 2022-02-11T23:46:34Z

lm_eval/tasks/qasper.py

+ )
+
+ def doc_to_target(self, doc):
+ # this method is invoked by tests only


this is used for fewshot

Got rid of this

leogao2 · 2022-02-11T23:47:34Z

lm_eval/tasks/qasper.py

+ DATASET_NAME = None
+
+ def doc_to_text(self, doc):
+ # this method is invoked by tests only


this method is used to create the doc that gets passed to construct_requests

Got rid of this

codecov · 2022-02-12T02:25:44Z

Codecov Report

Merging #264 (96f3e5b) into master (05590e1) will decrease coverage by 0.15%.
The diff coverage is 90.43%.

❗ Current head 96f3e5b differs from pull request most recent head 815f165. Consider uploading reports for the commit 815f165 to get more accurate results

@@            Coverage Diff             @@
##           master     #264      +/-   ##
==========================================
- Coverage   95.52%   95.37%   -0.16%     
==========================================
  Files          46       47       +1     
  Lines        3758     3873     +115     
==========================================
+ Hits         3590     3694     +104     
- Misses        168      179      +11

Impacted Files	Coverage Δ
lm_eval/tasks/qasper.py	`90.35% <90.35%> (ø)`
lm_eval/tasks/__init__.py	`88.57% <100.00%> (+0.16%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 05590e1...815f165. Read the comment docs.

leogao2 · 2022-02-13T05:54:42Z

I ran it on gpt3 and got all zeros except for f1_abstractive which is 0.13. The zeroes don't seem correct

leogao2

Needs double check for why the tasks have f1 score 0

Add QASPER task

Stephen Hogg added 5 commits February 8, 2022 22:02

Add initial draft of QASPER; register with package; yet to complete p…

7949262

…rocess_results

Add citation; update process_results and construct_requests

7ed5e29

exponentiate unanswerable score; rename dict key

c2f1247

Include extraction in process_results; fixes per test results

c6a3569

Disable extractive spans

96f3e5b

leogao2 reviewed Feb 11, 2022

View reviewed changes

Stephen Hogg added 2 commits February 12, 2022 10:56

Name changes as per leogao2

894c069

Remove incorrect comments

af9766d

leogao2 requested changes Feb 13, 2022

View reviewed changes

Stephen Hogg added 4 commits February 13, 2022 18:16

Remove extractive spans score; mark as TODO

8f0e176

Align casing between doc creation and evaluation

be55ea8

Mark unanswerable as TODO

f7aaff0

Don't create requests for TODO metrics

815f165

leogao2 approved these changes Feb 22, 2022

View reviewed changes

leogao2 merged commit 3c37ea9 into EleutherAI:master Feb 22, 2022

qmdnls pushed a commit to qmdnls/lm-evaluation-harness that referenced this pull request Aug 17, 2023

Merge pull request EleutherAI#264 from StephenHogg/qasper

bd9a7f8

Add QASPER task

LZY-the-boys pushed a commit to LZY-the-boys/lm-evaluation-harness-fast that referenced this pull request Sep 12, 2023

Merge pull request EleutherAI#264 from StephenHogg/qasper

f0c19c8

Add QASPER task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add QASPER task #264

Add QASPER task #264

StephenHogg commented Feb 8, 2022

leogao2 Feb 11, 2022

StephenHogg Feb 11, 2022

leogao2 Feb 11, 2022

StephenHogg Feb 11, 2022

leogao2 Feb 11, 2022

StephenHogg Feb 11, 2022

leogao2 Feb 11, 2022

StephenHogg Feb 11, 2022

codecov bot commented Feb 12, 2022 •

edited

Loading

leogao2 commented Feb 13, 2022 •

edited

Loading

leogao2 left a comment •

edited

Loading

Add QASPER task #264

Add QASPER task #264

Conversation

StephenHogg commented Feb 8, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Feb 12, 2022 • edited Loading

Codecov Report

leogao2 commented Feb 13, 2022 • edited Loading

leogao2 left a comment • edited Loading

Choose a reason for hiding this comment

codecov bot commented Feb 12, 2022 •

edited

Loading

leogao2 commented Feb 13, 2022 •

edited

Loading

leogao2 left a comment •

edited

Loading