Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add QASPER task #264

Merged
merged 11 commits into from
Feb 22, 2022
Merged

Add QASPER task #264

merged 11 commits into from
Feb 22, 2022

Conversation

StephenHogg
Copy link

Closes #184

A dictionary where keys are the names of submetrics and values are
whether a higher value of the submetric is better
"""
return {"f1_un": True, "f1_yn": True, "f1_ab": True, "f1_ex": True}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you make these the full names? like f1_unanswerable, f1_yesno, etc

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return res_dict

def aggregation(self):
return {"f1_un": f1_score, "f1_yn": f1_score, "f1_ab": mean, "f1_ex": mean}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double checking, are the latter 2 supposed to be mean, and is yesno supposed to be f1?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f1_abstractive and f1_extractive are themselves already F1 scores, so the original evaluation calculates the mean over F1 scores.

)

def doc_to_target(self, doc):
# this method is invoked by tests only
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is used for fewshot

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got rid of this

DATASET_NAME = None

def doc_to_text(self, doc):
# this method is invoked by tests only
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method is used to create the doc that gets passed to construct_requests

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got rid of this

@codecov
Copy link

codecov bot commented Feb 12, 2022

Codecov Report

Merging #264 (96f3e5b) into master (05590e1) will decrease coverage by 0.15%.
The diff coverage is 90.43%.

❗ Current head 96f3e5b differs from pull request most recent head 815f165. Consider uploading reports for the commit 815f165 to get more accurate results

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #264      +/-   ##
==========================================
- Coverage   95.52%   95.37%   -0.16%     
==========================================
  Files          46       47       +1     
  Lines        3758     3873     +115     
==========================================
+ Hits         3590     3694     +104     
- Misses        168      179      +11     
Impacted Files Coverage Δ
lm_eval/tasks/qasper.py 90.35% <90.35%> (ø)
lm_eval/tasks/__init__.py 88.57% <100.00%> (+0.16%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 05590e1...815f165. Read the comment docs.

@leogao2
Copy link
Contributor

leogao2 commented Feb 13, 2022

I ran it on gpt3 and got all zeros except for f1_abstractive which is 0.13. The zeroes don't seem correct

Copy link
Contributor

@leogao2 leogao2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs double check for why the tasks have f1 score 0

@leogao2 leogao2 merged commit 3c37ea9 into EleutherAI:master Feb 22, 2022
qmdnls pushed a commit to qmdnls/lm-evaluation-harness that referenced this pull request Aug 17, 2023
LZY-the-boys pushed a commit to LZY-the-boys/lm-evaluation-harness-fast that referenced this pull request Sep 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement the QASPER evaluation
2 participants