Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement PTB evaluation #135

Closed
wants to merge 6 commits into from
Closed

Implement PTB evaluation #135

wants to merge 6 commits into from

Conversation

EricHallahan
Copy link

Implement the Penn Treebank evaluation as described in #5

@codecov
Copy link

codecov bot commented Feb 8, 2021

Codecov Report

Merging #135 (69ec7a8) into master (2b8956b) will decrease coverage by 8.37%.
The diff coverage is 100.00%.

❗ Current head 69ec7a8 differs from pull request most recent head b667da9. Consider uploading reports for the commit b667da9 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master     #135      +/-   ##
==========================================
- Coverage   83.85%   75.47%   -8.38%     
==========================================
  Files          43       34       -9     
  Lines        3048     2035    -1013     
==========================================
- Hits         2556     1536    -1020     
- Misses        492      499       +7     
Impacted Files Coverage Δ
lm_eval/tasks/__init__.py 100.00% <100.00%> (+8.16%) ⬆️
lm_eval/tasks/ptb.py 100.00% <100.00%> (ø)
lm_eval/tasks/drop.py 0.00% <0.00%> (-91.61%) ⬇️
lm_eval/tasks/coqa.py 0.00% <0.00%> (-88.51%) ⬇️
lm_eval/tasks/openbookqa.py 42.85% <0.00%> (-57.15%) ⬇️
lm_eval/tasks/squad.py 52.63% <0.00%> (-47.37%) ⬇️
lm_eval/utils.py 59.25% <0.00%> (-22.71%) ⬇️
lm_eval/models/dummy.py 60.00% <0.00%> (-13.69%) ⬇️
lm_eval/tasks/superglue.py 88.18% <0.00%> (-11.82%) ⬇️
lm_eval/tasks/pubmedqa.py 90.47% <0.00%> (-6.83%) ⬇️
... and 34 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9c4967b...b667da9. Read the comment docs.

EricHallahan and others added 3 commits February 8, 2021 20:57
It now takes loglikelihood of full sentence rather than just last word.
Length normalizing is done using the *original*, pre-detokenization word count.
@leogao2
Copy link
Contributor

leogao2 commented Feb 12, 2021

Blocked on figuring out why the heck our score is an order of magnitude worse than the value for 117M in the GPT2 paper

@StellaAthena StellaAthena linked an issue Feb 18, 2021 that may be closed by this pull request
@leogao2 leogao2 closed this Jun 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement the Penn Tree Bank evaluation
2 participants