Implement the symbolic manipulations evaluation #26

StellaAthena · 2020-09-16T17:13:44Z

From the GPT-3 paper:

To test GPT-3’s ability to learn novel symbolic manipulations from a few examples, we designed a small battery of 5 “character manipulation” tasks. Each task involves giving the model a word distorted by some combination of scrambling, addition, or deletion of characters, and asking it to recover the original word. The 5 tasks are:
• Cycle letters in word (CL) – The model is given a word with its letters cycled, then the “=” symbol, and is expected to generate the original word. For example, it might be given “lyinevitab” and should output “inevitably”.
• Anagrams of all but first and last characters (A1) – The model is given a word where every letter except the first and last have been scrambled randomly, and must output the original word. Example: criroptuon = corruption.
• Anagrams of all but first and last 2 characters (A2) – The model is given a word where every letter except the first 2 and last 2 have been scrambled randomly, and must recover the original word. Example: opoepnnt → opponent.
• Random insertion in word (RI) – A random punctuation or space character is inserted between each letter of a word, and the model must output the original word. Example: s.u!c/c!e.s s i/o/n = succession.
• Reversed words (RW) – The model is given a word spelled backwards, and must output the original word. Example: stcejbo → objects.
For each task we generate 10,000 examples, which we chose to be the top 10,000 most frequent words as measured by [Nor09] of length more than 4 characters and less than 15 characters. The few-shot results are shown in Figure 3.11. Task performance tends to grow smoothly with model size, with the full GPT-3 model achieving 66.9% on removing random insertions, 38.6% on cycling letters, 40.2% on the easier anagram task, and 15.1% on the more difficult anagram task (where only the first and last letters are held fixed). None of the models can reverse the letters in a word. In the one-shot setting, performance is significantly weaker (dropping by half or more), and in the zero-shot setting the model can rarely perform any of the tasks (Table 3.10). This suggests that the model really does appear to learn these tasks at test time, as the model cannot perform them zero-shot and their artificial nature makes them unlikely to appear in the pre-training data (although we cannot confirm this with certainty).
We can further quantify performance by plotting “in-context learning curves”, which show task performance as a function of the number of in-context examples. We show in-context learning curves for the Symbol Insertion task in Figure 1.2. We can see that larger models are able to make increasingly effective use of in-context information, including both task examples and natural language task descriptions.
Finally, it is worth adding that solving these tasks requires character-level manipulations, whereas our BPE encoding operates on significant fractions of a word (on average ∼ 0.7 words per token), so from the LM’s perspective succeeding at these tasks involves not just manipulating BPE tokens but understanding and pulling apart their sub structure. Also, CL, A1, and A2 are not bijective (that is, the unscrambled word is not a deterministic function of the scrambled word), requiring the model to perform some search to find the correct unscrambling. Thus, the skills involved appear to require non-trivial pattern-matching and computation

This is a task where we need to create a custom dataset for evaluation.

Data processing code implemented
Evaluation implemented

The evaluation code should be modeled after the interface in lm_eval/base.py and the example of the BoolQ task in lm_eval/tasks/suerglue.py

The text was updated successfully, but these errors were encountered:

nicholaskross · 2020-10-10T01:50:08Z

I'd like to do this!
I am pretty new to all of this, but let me see if I understand this correctly:

Take the tests described above.
Convert them into tasks we could feed to GPT.
Put those into whatever format we're using for our tests.
PR those into this and close the issue.

Any really specific tools I should use, or will Python suffice?

StellaAthena · 2020-10-10T04:32:03Z

@nicholaskross thats correct! The text above is the description of the task as it is presented in the the GPT-3 paper, “language models are few shot learners.” If you look at the existing tests in this repo you can use them as a template. For this task there isn’t any external data – you should randomly generate a dataset that meets the specifications. And yes, everything should be in Python.

Welcome to the team :) Let me know if there’s any questions I can answer. And feel free to ask anything in the #lm-thunderdome channel on Discord.

nicholaskross · 2020-10-10T20:46:14Z

Thanks! Will do

StellaAthena · 2020-10-23T04:17:41Z

@nicholaskross Hey, wanted to ping you and check in. How is it coming?

nicholaskross · 2020-10-24T03:41:43Z

Ah, sorry I haven't made much progress yet! I was busy with schoolwork etc. Hoping to get more energy soon...
I looked at the other tests in this repo, wasn't sure I found what the test format should look like (a lot were links to that one site with NLP things). What meta-format should GPT tests look like here?

anishthite · 2020-10-24T04:50:13Z

Not sure what you mean by meta-format, but sample texts can be found in Appendix G in the GPT3 paper (https://arxiv.org/pdf/2005.14165.pdf). For example on page 55 you can see how a CL task doc looked like

nicholaskross · 2020-10-28T19:36:46Z

Not sure what you mean by meta-format, but sample texts can be found in Appendix G in the GPT3 paper (https://arxiv.org/pdf/2005.14165.pdf). For example on page 55 you can see how a CL task doc looked like

Ah, okay thanks! I'll have it output tests in that format, then (txt files until/unless we use use a different one).

nicholaskross · 2020-11-22T02:15:13Z

@StellaAthena I have code that can create the dataset of common words (from [Nor09]) and the transformed words for the tasks.
What class should my sym_manip.py extend, in /lm_eval? And how to put the tasks in usable form based on downloading the [Nor09] words cited above? (count_1w.txt)

StellaAthena · 2021-01-05T17:43:37Z

@StellaAthena I have code that can create the dataset of common words (from [Nor09]) and the transformed words for the tasks.
What class should my sym_manip.py extend, in /lm_eval? And how to put the tasks in usable form based on downloading the [Nor09] words cited above? (count_1w.txt)

Nicholas,

Sorry about disappearing on you, I got distracted by othre things. Are you still interested in doing this?

Fix max generation limit

StellaAthena added the feature request A feature that isn't implemented yet. label Sep 16, 2020

StellaAthena added this to To do in Implementing Evaluations via automation Sep 16, 2020

StellaAthena moved this from To do to In progress in Implementing Evaluations Oct 10, 2020

StellaAthena assigned nicholaskross Oct 10, 2020

StellaAthena added Eval Set and removed feature request A feature that isn't implemented yet. labels Oct 23, 2020

StellaAthena removed the Eval Set label Dec 23, 2020

StellaAthena closed this as completed Jan 4, 2021

StellaAthena reopened this Jan 5, 2021

StellaAthena unassigned nicholaskross Jan 5, 2021

StellaAthena added feature request A feature that isn't implemented yet. good first issue Good for newcomers labels Jan 5, 2021

jon-tow mentioned this issue Feb 24, 2021

Implement symbol manipulation based evaluations #156

Merged

jon-tow self-assigned this Feb 26, 2021

jon-tow closed this as completed Feb 26, 2021

jon-tow added this to Done, evaluations in Implementing Evaluations Feb 26, 2021

StellaAthena added a commit that referenced this issue Apr 29, 2022

Merge pull request #26 from bigscience-workshop/max-gen-fix

ad23a86

Fix max generation limit

qmdnls pushed a commit to qmdnls/lm-evaluation-harness that referenced this issue Aug 17, 2023

Merge pull request EleutherAI#26 from bigscience-workshop/max-gen-fix

26202a0

Fix max generation limit

LZY-the-boys pushed a commit to LZY-the-boys/lm-evaluation-harness-fast that referenced this issue Sep 12, 2023

Merge pull request EleutherAI#26 from bigscience-workshop/max-gen-fix

fd84d27

Fix max generation limit

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement the symbolic manipulations evaluation #26

Implement the symbolic manipulations evaluation #26

StellaAthena commented Sep 16, 2020 •

edited by jon-tow

Loading

nicholaskross commented Oct 10, 2020

StellaAthena commented Oct 10, 2020 •

edited

Loading

nicholaskross commented Oct 10, 2020

StellaAthena commented Oct 23, 2020

nicholaskross commented Oct 24, 2020

anishthite commented Oct 24, 2020

nicholaskross commented Oct 28, 2020

nicholaskross commented Nov 22, 2020

StellaAthena commented Jan 5, 2021

Implement the symbolic manipulations evaluation #26

Implement the symbolic manipulations evaluation #26

Comments

StellaAthena commented Sep 16, 2020 • edited by jon-tow Loading

nicholaskross commented Oct 10, 2020

StellaAthena commented Oct 10, 2020 • edited Loading

nicholaskross commented Oct 10, 2020

StellaAthena commented Oct 23, 2020

nicholaskross commented Oct 24, 2020

anishthite commented Oct 24, 2020

nicholaskross commented Oct 28, 2020

nicholaskross commented Nov 22, 2020

StellaAthena commented Jan 5, 2021

StellaAthena commented Sep 16, 2020 •

edited by jon-tow

Loading

StellaAthena commented Oct 10, 2020 •

edited

Loading