forked from EleutherAI/website
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
13 changed files
with
448 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
--- | ||
title: "Activation function ablation" | ||
date: 2021-05-24T14:00:00-06:00 | ||
draft: False | ||
--- | ||
by Leo Gao | ||
|
||
This was an ablation of activation functions on GPT-like models of ~100M params that I ran ages ago. Each model was run for 10k iters, which isn't very long. My original goal was to show that activation function doesn't matter than much, but to do so I'd need to run a bunch more runs to get variance and show no statistical significance, and I don't plan on running a more exhaustive version of this experiment any time soon. So, I'm just dumping these results here in case anyone has any use for them. All the activation definitions are [here](https://github.com/EleutherAI/gpt-neo/blob/master/models/activations.py#L44). | ||
|
||
| Name | Pile Validation BPB | LAMBADA acc | LAMBADA ppl | | ||
| --- | --- | --- | --- | | ||
| softsign | 1.1485 | 34.3 | 81.32 | | ||
| ReLU | 1.1482 | 34.3 | 82.01 | | ||
| spike2 | 1.1480 | 34.4 | 83.13 | | ||
| selu | 1.1485 | 34.5 | 83.32 | | ||
| elish | 1.1492 | 33.9 | 84.04 | | ||
| tanhexp | 1.1474 | 33.7 | 84.06 | | ||
| sigmoid | 1.1484 | 33.9 | 85.20 | | ||
| tanhshrink | 1.1483 | 33.9 | 85.42 | | ||
| maxtanh | 1.1479 | 33.7 | 85.53 | | ||
| roottanh | 1.1485 | 33.4 | 86.00 | | ||
| softplusmone | 1.1488 | 34.1 | 86.21 | | ||
| logsoftmax | 1.1492 | 34.2 | 86.29 | | ||
| ELU | 1.1496 | 33.8 | 86.37 | | ||
| Swish | 1.1482 | 33.7 | 86.42 | | ||
| softmax | 1.1491 | 33.2 | 86.74 | | ||
| square_relax | 1.1484 | 33.5 | 86.92 | | ||
| lisht | 1.1500 | 33.8 | 87.17 | | ||
| GELU | 1.1453 | 34.0 | 87.84 | | ||
| abs | 1.1489 | 33.5 | 87.96 | | ||
| tanh | 1.1481 | 33.2 | 89.28 | | ||
| Mish | 1.1482 | 33.6 | 89.84 | | ||
| triangle_relax | 1.1502 | 33.7 | 89.91 | | ||
| seagull | 1.1487 | 33.3 | 90.08 | | ||
| maxsig | 1.1480 | 33.3 | 90.23 | | ||
| softplus | 1.1460 | 33.1 | 90.74 | | ||
| minsin | 1.1498 | 33.3 | 91.18 | | ||
| snake | 1.1484 | 33.1 | 91.93 | | ||
| cosid | 1.1490 | 33.3 | 92.99 | | ||
| spike | 1.1498 | 33.3 | 93.78 | | ||
| bipolarsigmoid | 1.1513 | 32.8 | 96.73 | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
--- | ||
title: "Trying different fewshot description prompting for GPT-3" | ||
date: 2021-05-24T14:00:02-06:00 | ||
draft: False | ||
--- | ||
by Leo Gao | ||
|
||
[Adam Shimi](https://www.alignmentforum.org/users/adamshimi) suggested the idea of trying different fewshot prompts on GPT-3, and hopefully observing something that evidenced larger models being able to handle a wider variety of prompting. He also wrote up a bunch of prompts to try on SST. | ||
|
||
Unfortunately, the results were kinda mixed: the GPT-2 models all did absolutely terrible and their results were basically useless; the performance wasn't monotonic with model size (2.7B did better than 1.3B, and babbage did better than curie). Also, the variance *increased* with performance in general. | ||
|
||
| | mean accuracy | stddev in accuracy | | ||
|--------------|---------------|--------------------| | ||
| gpt3-ada | 51.9 | 0.0368 | | ||
| gpt3-babbage | 69.4 | 0.0840 | | ||
| gpt3-curie | 67.4 | 0.0807 | | ||
| neo-1.3B | 63.0 | 0.0522 | | ||
| neo-2.7B | 56.5 | 0.0684 | | ||
|
||
However, there was one interesting and unexpected result: there's basically no correlation between different models on which prompts do the best. This is highly unexpected because I'd expect *a priori* that models trained on the same/similar data should have similar preferences for what kinds of prompts work well, and that surely some prompts must be better than other prompts in general. | ||
|
||
Here's what that looks like plotted out. Each point in these plots is one prompt, and the axes are different models. The values are SST accuracy: | ||
|
||
![](/images/research-log/fig_gpt2_pretrained-EleutherAI_gpt-neo-1.3B_gpt2_pretrained-EleutherAI_gpt-neo-2.7B.png) | ||
![](/images/research-log/fig_gpt3_engine-ada_gpt2_pretrained-EleutherAI_gpt-neo-1.3B.png) | ||
![](/images/research-log/fig_gpt3_engine-ada_gpt2_pretrained-EleutherAI_gpt-neo-2.7B.png) | ||
![](/images/research-log/fig_gpt3_engine-ada_gpt3_engine-babbage.png) | ||
![](/images/research-log/fig_gpt3_engine-ada_gpt3_engine-curie.png) | ||
![](/images/research-log/fig_gpt3_engine-babbage_gpt2_pretrained-EleutherAI_gpt-neo-1.3B.png) | ||
![](/images/research-log/fig_gpt3_engine-babbage_gpt2_pretrained-EleutherAI_gpt-neo-2.7B.png) | ||
![](/images/research-log/fig_gpt3_engine-babbage_gpt3_engine-curie.png) | ||
![](/images/research-log/fig_gpt3_engine-curie_gpt2_pretrained-EleutherAI_gpt-neo-1.3B.png) | ||
![](/images/research-log/fig_gpt3_engine-curie_gpt2_pretrained-EleutherAI_gpt-neo-2.7B.png) | ||
|
||
The code for the experiment is [here](https://gist.github.com/leogao2/d156d8e0f49ac83b239dde3819668b4b). |
Oops, something went wrong.