Implement arithmetic evaluations #25

StellaAthena · 2020-09-16T17:10:31Z

From the GPT-3 paper:

To test GPT-3’s ability to perform simple arithmetic operations without task-specific training, we developed a small
battery of 10 tests that involve asking GPT-3 a simple arithmetic problem in natural language:
• 2 digit addition (2D+) – The model is asked to add two integers sampled uniformly from [0, 100), phrased in
the form of a question, e.g. “Q: What is 48 plus 76? A: 124.”
• 2 digit subtraction (2D-) – The model is asked to subtract two integers sampled uniformly from [0, 100); the
answer may be negative. Example: “Q: What is 34 minus 53? A: -19”.
• 3 digit addition (3D+) – Same as 2 digit addition, except numbers are uniformly sampled from [0, 1000)
• 4 digit addition (4D+) – Same as 3 digit addition, except uniformly sampled from [0, 10000).
• 4 digit subtraction (4D-) – Same as 3 digit subtraction, except uniformly sampled from [0, 10000).
• 5 digit addition (5D+) – Same as 3 digit addition, except uniformly sampled from [0, 100000).
• 5 digit subtraction (5D-) – Same as 3 digit subtraction, except uniformly sampled from [0, 100000).
• 2 digit multiplication (2Dx) – The model is asked to multiply two integers sampled uniformly from [0, 100),
e.g. “Q: What is 24 times 42? A: 1008”.
• One-digit composite (1DC) – The model is asked to perform a composite operation on three 1 digit numbers,
with parentheses around the last two. For example, “Q: What is 6+(4*8)? A: 38”. The three 1 digit numbers
are selected uniformly on [0, 10) and the operations are selected uniformly from {+,-,*}.

In all 10 tasks the model must generate the correct answer exactly. For each task we generate a dataset of 2,000 random
instances of the task and evaluate all models on those instances. To spot-check whether the model is simply memorizing specific arithmetic problems, we took the 3-digit arithmetic problems in our test set and searched for them in our training data in both the forms " + =" and " plus ". Out of 2,000 addition problems we found only 17 matches (0.8%) and out of 2,000 subtraction problems we found only 2 matches (0.1%), suggesting that only a trivial fraction of the correct answers could have been memorized. In addition, inspection of incorrect answers reveals that the model often makes mistakes such as not carrying a “1”, suggesting it is actually attempting to perform the relevant computation rather than
memorizing a table

Data processing code implemented
Evaluation implemented

This should be modeled after the BoolQ task in lm_eval/tasks/suerglue.py

The text was updated successfully, but these errors were encountered:

VitamintK · 2020-09-20T23:57:07Z

I'll take this!

StellaAthena · 2020-10-23T04:18:19Z

@VitamintK Hey Kevin, how's this coming? Still planning on doing it?

VitamintK · 2020-10-23T04:22:33Z

I am, but I've been putting it off and haven't started yet.

StellaAthena · 2020-10-23T04:40:41Z

No worries! Just wanted to check in and make sure it was on your radar. Do you think you'll be able to get it done in the next week or two? Finishing implementing the eval datasets is currently the major project blocker; we need to finish this before we can dedupe and we need to do that before we can start training models.

VitamintK · 2020-10-26T04:48:31Z

Yup, I'll try to get it done soon!

StellaAthena · 2021-01-05T17:31:17Z

@VitamintK What's happened? Is this something you have done? Should I assign it to someone else?

VitamintK · 2021-01-15T03:00:36Z

@StellaAthena hey, sorry for not following up with this until now, I am the worst. Finally got around to it, though!

Adding CrowsPairs task for English and French

add verbalizer for afrixnli

StellaAthena added the feature request A feature that isn't implemented yet. label Sep 16, 2020

StellaAthena assigned VitamintK Sep 20, 2020

StellaAthena added Eval Set and removed feature request A feature that isn't implemented yet. labels Oct 23, 2020

StellaAthena removed the Eval Set label Dec 23, 2020

StellaAthena closed this as completed Jan 4, 2021

StellaAthena reopened this Jan 5, 2021

StellaAthena assigned VitamintK and unassigned VitamintK Jan 5, 2021

StellaAthena added feature request A feature that isn't implemented yet. good first issue Good for newcomers labels Jan 5, 2021

VitamintK mentioned this issue Jan 15, 2021

first attempt at adding arithmetic evaluations #87

Merged

leogao2 closed this as completed Jan 28, 2021

StellaAthena added a commit that referenced this issue Apr 29, 2022

Merge pull request #25 from oskarvanderwal/master

22155f7

Adding CrowsPairs task for English and French

qmdnls pushed a commit to qmdnls/lm-evaluation-harness that referenced this issue Aug 17, 2023

Merge pull request EleutherAI#25 from oskarvanderwal/master

cd0e1d8

Adding CrowsPairs task for English and French

LZY-the-boys pushed a commit to LZY-the-boys/lm-evaluation-harness-fast that referenced this issue Sep 12, 2023

Merge pull request EleutherAI#25 from oskarvanderwal/master

c302561

Adding CrowsPairs task for English and French

lintangsutawika pushed a commit that referenced this issue Jul 8, 2024

Merge pull request #25 from JessicaOjo/africamgsm

bbb371a

add verbalizer for afrixnli

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement arithmetic evaluations #25

Implement arithmetic evaluations #25

StellaAthena commented Sep 16, 2020 •

edited by leogao2

Loading

VitamintK commented Sep 20, 2020

StellaAthena commented Oct 23, 2020

VitamintK commented Oct 23, 2020

StellaAthena commented Oct 23, 2020

VitamintK commented Oct 26, 2020

StellaAthena commented Jan 5, 2021

VitamintK commented Jan 15, 2021

Implement arithmetic evaluations #25

Implement arithmetic evaluations #25

Comments

StellaAthena commented Sep 16, 2020 • edited by leogao2 Loading

VitamintK commented Sep 20, 2020

StellaAthena commented Oct 23, 2020

VitamintK commented Oct 23, 2020

StellaAthena commented Oct 23, 2020

VitamintK commented Oct 26, 2020

StellaAthena commented Jan 5, 2021

VitamintK commented Jan 15, 2021

StellaAthena commented Sep 16, 2020 •

edited by leogao2

Loading