Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement the symbolic manipulations evaluation #26

Closed
2 tasks done
StellaAthena opened this issue Sep 16, 2020 · 9 comments
Closed
2 tasks done

Implement the symbolic manipulations evaluation #26

StellaAthena opened this issue Sep 16, 2020 · 9 comments
Assignees
Labels
feature request A feature that isn't implemented yet. good first issue Good for newcomers

Comments

@StellaAthena
Copy link
Member

StellaAthena commented Sep 16, 2020

From the GPT-3 paper:

To test GPT-3’s ability to learn novel symbolic manipulations from a few examples, we designed a small battery of 5 “character manipulation” tasks. Each task involves giving the model a word distorted by some combination of scrambling, addition, or deletion of characters, and asking it to recover the original word. The 5 tasks are:
• Cycle letters in word (CL) – The model is given a word with its letters cycled, then the “=” symbol, and is expected to generate the original word. For example, it might be given “lyinevitab” and should output “inevitably”.
• Anagrams of all but first and last characters (A1) – The model is given a word where every letter except the first and last have been scrambled randomly, and must output the original word. Example: criroptuon = corruption.
• Anagrams of all but first and last 2 characters (A2) – The model is given a word where every letter except the first 2 and last 2 have been scrambled randomly, and must recover the original word. Example: opoepnnt → opponent.
• Random insertion in word (RI) – A random punctuation or space character is inserted between each letter of a word, and the model must output the original word. Example: s.u!c/c!e.s s i/o/n = succession.
• Reversed words (RW) – The model is given a word spelled backwards, and must output the original word. Example: stcejbo → objects.
For each task we generate 10,000 examples, which we chose to be the top 10,000 most frequent words as measured by [Nor09] of length more than 4 characters and less than 15 characters. The few-shot results are shown in Figure 3.11. Task performance tends to grow smoothly with model size, with the full GPT-3 model achieving 66.9% on removing random insertions, 38.6% on cycling letters, 40.2% on the easier anagram task, and 15.1% on the more difficult anagram task (where only the first and last letters are held fixed). None of the models can reverse the letters in a word. In the one-shot setting, performance is significantly weaker (dropping by half or more), and in the zero-shot setting the model can rarely perform any of the tasks (Table 3.10). This suggests that the model really does appear to learn these tasks at test time, as the model cannot perform them zero-shot and their artificial nature makes them unlikely to appear in the pre-training data (although we cannot confirm this with certainty).
We can further quantify performance by plotting “in-context learning curves”, which show task performance as a function of the number of in-context examples. We show in-context learning curves for the Symbol Insertion task in Figure 1.2. We can see that larger models are able to make increasingly effective use of in-context information, including both task examples and natural language task descriptions.
Finally, it is worth adding that solving these tasks requires character-level manipulations, whereas our BPE encoding operates on significant fractions of a word (on average ∼ 0.7 words per token), so from the LM’s perspective succeeding at these tasks involves not just manipulating BPE tokens but understanding and pulling apart their sub structure. Also, CL, A1, and A2 are not bijective (that is, the unscrambled word is not a deterministic function of the scrambled word), requiring the model to perform some search to find the correct unscrambling. Thus, the skills involved appear to require non-trivial pattern-matching and computation

This is a task where we need to create a custom dataset for evaluation.

  • Data processing code implemented
  • Evaluation implemented

The evaluation code should be modeled after the interface in lm_eval/base.py and the example of the BoolQ task in lm_eval/tasks/suerglue.py

@StellaAthena StellaAthena added the feature request A feature that isn't implemented yet. label Sep 16, 2020
@StellaAthena StellaAthena added this to To do in Implementing Evaluations via automation Sep 16, 2020
@nicholaskross
Copy link
Contributor

I'd like to do this!
I am pretty new to all of this, but let me see if I understand this correctly:

  1. Take the tests described above.
  2. Convert them into tasks we could feed to GPT.
  3. Put those into whatever format we're using for our tests.
  4. PR those into this and close the issue.

Any really specific tools I should use, or will Python suffice?

@StellaAthena
Copy link
Member Author

StellaAthena commented Oct 10, 2020

@nicholaskross thats correct! The text above is the description of the task as it is presented in the the GPT-3 paper, “language models are few shot learners.” If you look at the existing tests in this repo you can use them as a template. For this task there isn’t any external data – you should randomly generate a dataset that meets the specifications. And yes, everything should be in Python.

Welcome to the team :) Let me know if there’s any questions I can answer. And feel free to ask anything in the #lm-thunderdome channel on Discord.

@nicholaskross
Copy link
Contributor

Thanks! Will do

@StellaAthena StellaAthena added Eval Set and removed feature request A feature that isn't implemented yet. labels Oct 23, 2020
@StellaAthena
Copy link
Member Author

@nicholaskross Hey, wanted to ping you and check in. How is it coming?

@nicholaskross
Copy link
Contributor

Ah, sorry I haven't made much progress yet! I was busy with schoolwork etc. Hoping to get more energy soon...
I looked at the other tests in this repo, wasn't sure I found what the test format should look like (a lot were links to that one site with NLP things). What meta-format should GPT tests look like here?

@anishthite
Copy link
Member

Not sure what you mean by meta-format, but sample texts can be found in Appendix G in the GPT3 paper (https://arxiv.org/pdf/2005.14165.pdf). For example on page 55 you can see how a CL task doc looked like

@nicholaskross
Copy link
Contributor

Not sure what you mean by meta-format, but sample texts can be found in Appendix G in the GPT3 paper (https://arxiv.org/pdf/2005.14165.pdf). For example on page 55 you can see how a CL task doc looked like

Ah, okay thanks! I'll have it output tests in that format, then (txt files until/unless we use use a different one).

@nicholaskross
Copy link
Contributor

@StellaAthena I have code that can create the dataset of common words (from [Nor09]) and the transformed words for the tasks.
What class should my sym_manip.py extend, in /lm_eval? And how to put the tasks in usable form based on downloading the [Nor09] words cited above? (count_1w.txt)

@StellaAthena
Copy link
Member Author

@StellaAthena I have code that can create the dataset of common words (from [Nor09]) and the transformed words for the tasks.
What class should my sym_manip.py extend, in /lm_eval? And how to put the tasks in usable form based on downloading the [Nor09] words cited above? (count_1w.txt)

Nicholas,

Sorry about disappearing on you, I got distracted by othre things. Are you still interested in doing this?

@StellaAthena StellaAthena added feature request A feature that isn't implemented yet. good first issue Good for newcomers labels Jan 5, 2021
@jon-tow jon-tow self-assigned this Feb 26, 2021
@jon-tow jon-tow closed this as completed Feb 26, 2021
@jon-tow jon-tow added this to Done, evaluations in Implementing Evaluations Feb 26, 2021
StellaAthena added a commit that referenced this issue Apr 29, 2022
qmdnls pushed a commit to qmdnls/lm-evaluation-harness that referenced this issue Aug 17, 2023
LZY-the-boys pushed a commit to LZY-the-boys/lm-evaluation-harness-fast that referenced this issue Sep 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request A feature that isn't implemented yet. good first issue Good for newcomers
Projects
No open projects
Implementing Evaluations
  
Done, evaluations
Development

No branches or pull requests

4 participants