Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement the SAT evaluation #27

Closed
2 tasks done
StellaAthena opened this issue Sep 16, 2020 · 4 comments · Fixed by #57, #80 or #82
Closed
2 tasks done

Implement the SAT evaluation #27

StellaAthena opened this issue Sep 16, 2020 · 4 comments · Fixed by #57, #80 or #82
Assignees
Labels
feature request A feature that isn't implemented yet. good first issue Good for newcomers

Comments

@StellaAthena
Copy link
Member

StellaAthena commented Sep 16, 2020

From the GPT-3 paper:

To test GPT-3 on another task that is somewhat unusual relative to the typical distribution of text, we collected a set of 374 “SAT analogy” problems [TLBS03]. Analogies are a style of multiple choice question that constituted a section of the SAT college entrance exam before 2005. A typical example is “audacious is to boldness as (a) sanctimonious is to hypocrisy, (b) anonymous is to identity, (c) remorseful is to misdeed, (d) deleterious is to result, (e) impressionable is to temptation”. The student is expected to choose which of the five word pairs has the same relationship as the original word pair; in this example the answer is “sanctimonious is to hypocrisy”. On this task GPT-3 achieves 65.2% in the few-shot setting, 59.1% in the one-shot setting, and 53.7% in the zero-shot setting, whereas the average score among college applicants was 57% [TL05] (random guessing yields 20%). As shown in Figure 3.12, the results improve with scale, with the the full 175 billion model improving by over 10% compared to the 13 billion parameter model.

  • Data processing code implemented
  • Evaluation implemented

The evaluation code should be modeled after the interface in lm_eval/base.py and the example of the BoolQ task in lm_eval/tasks/suerglue.py

@StellaAthena StellaAthena added the feature request A feature that isn't implemented yet. label Sep 16, 2020
@cfoster0
Copy link
Contributor

cfoster0 commented Oct 1, 2020

Should be available on request from Peter Turney.

https://www.apperceptual.com/

@StellaAthena StellaAthena added Eval Set and removed feature request A feature that isn't implemented yet. labels Oct 23, 2020
@cfoster0
Copy link
Contributor

Will post here if/when we get a response.

@cfoster0
Copy link
Contributor

Got a response. PM me on the Discord if you need access.

@StellaAthena StellaAthena linked a pull request Oct 26, 2020 that will close this issue
@StellaAthena StellaAthena reopened this Nov 18, 2020
@StellaAthena StellaAthena reopened this Jan 5, 2021
@StellaAthena StellaAthena added feature request A feature that isn't implemented yet. good first issue Good for newcomers labels Jan 5, 2021
@nicholaskross
Copy link
Contributor

I could do the eval here

@StellaAthena StellaAthena linked a pull request Jan 6, 2021 that will close this issue
@StellaAthena StellaAthena linked a pull request Jan 9, 2021 that will close this issue
@leogao2 leogao2 assigned cfoster0 and unassigned nicholaskross Feb 3, 2021
@leogao2 leogao2 self-assigned this Feb 3, 2021
StellaAthena added a commit that referenced this issue Apr 29, 2022
qmdnls pushed a commit to qmdnls/lm-evaluation-harness that referenced this issue Aug 17, 2023
LZY-the-boys pushed a commit to LZY-the-boys/lm-evaluation-harness-fast that referenced this issue Sep 12, 2023
lintangsutawika pushed a commit that referenced this issue Jul 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request A feature that isn't implemented yet. good first issue Good for newcomers
Projects
None yet
4 participants