add a bunch of research log posts

philpax · May 24, 2021 · 64f9f2a · 64f9f2a
1 parent 50a87ea
commit 64f9f2a
Show file tree

Hide file tree

Showing 13 changed files with 448 additions and 0 deletions.
diff --git a/content/research-log/activation-fns.md b/content/research-log/activation-fns.md
@@ -0,0 +1,41 @@
+---
+title: "Activation function ablation"
+date: 2021-05-24T14:00:00-06:00
+draft: False
+---
+by Leo Gao
+
+This was an ablation of activation functions on GPT-like models of ~100M params that I ran ages ago. Each model was run for 10k iters, which isn't very long. My original goal was to show that activation function doesn't matter than much, but to do so I'd need to run a bunch more runs to get variance and show no statistical significance, and I don't plan on running a more exhaustive version of this experiment any time soon. So, I'm just dumping these results here in case anyone has any use for them. All the activation definitions are [here](https://github.com/EleutherAI/gpt-neo/blob/master/models/activations.py#L44). 
+
+| Name | Pile Validation BPB | LAMBADA acc | LAMBADA ppl |
+| --- | --- | --- | --- |
+| softsign | 1.1485 | 34.3 | 81.32 |
+| ReLU | 1.1482 | 34.3 | 82.01 |
+| spike2 | 1.1480 | 34.4 | 83.13 |
+| selu | 1.1485 | 34.5 | 83.32 |
+| elish | 1.1492 | 33.9 | 84.04 |
+| tanhexp | 1.1474 | 33.7 | 84.06 |
+| sigmoid | 1.1484 | 33.9 | 85.20 |
+| tanhshrink | 1.1483 | 33.9 | 85.42 |
+| maxtanh | 1.1479 | 33.7 | 85.53 |
+| roottanh | 1.1485 | 33.4 | 86.00 |
+| softplusmone | 1.1488 | 34.1 | 86.21 |
+| logsoftmax | 1.1492 | 34.2 | 86.29 |
+| ELU | 1.1496 | 33.8 | 86.37 |
+| Swish | 1.1482 | 33.7 | 86.42 |
+| softmax | 1.1491 | 33.2 | 86.74 |
+| square_relax | 1.1484 | 33.5 | 86.92 |
+| lisht | 1.1500 | 33.8 | 87.17 |
+| GELU | 1.1453 | 34.0 | 87.84 |
+| abs | 1.1489 | 33.5 | 87.96 |
+| tanh | 1.1481 | 33.2 | 89.28 |
+| Mish | 1.1482 | 33.6 | 89.84 |
+| triangle_relax | 1.1502 | 33.7 | 89.91 |
+| seagull | 1.1487 | 33.3 | 90.08 |
+| maxsig | 1.1480 | 33.3 | 90.23 |
+| softplus | 1.1460 | 33.1 | 90.74 |
+| minsin | 1.1498 | 33.3 | 91.18 |
+| snake | 1.1484 | 33.1 | 91.93 |
+| cosid | 1.1490 | 33.3 | 92.99 |
+| spike | 1.1498 | 33.3 | 93.78 |
+| bipolarsigmoid | 1.1513 | 32.8 | 96.73 |
diff --git a/content/research-log/prompts-gpt-fewshot.md b/content/research-log/prompts-gpt-fewshot.md
@@ -0,0 +1,35 @@
+---
+title: "Trying different fewshot description prompting for GPT-3"
+date: 2021-05-24T14:00:02-06:00
+draft: False
+---
+by Leo Gao
+
+[Adam Shimi](https://www.alignmentforum.org/users/adamshimi) suggested the idea of trying different fewshot prompts on GPT-3, and hopefully observing something that evidenced larger models being able to handle a wider variety of prompting. He also wrote up a bunch of prompts to try on SST. 
+
+Unfortunately, the results were kinda mixed: the GPT-2 models all did absolutely terrible and their results were basically useless; the performance wasn't monotonic with model size (2.7B did better than 1.3B, and babbage did better than curie). Also, the variance *increased* with performance in general.
+
+| | mean accuracy | stddev in accuracy |
+|--------------|---------------|--------------------|
+| gpt3-ada | 51.9 | 0.0368 |
+| gpt3-babbage | 69.4 | 0.0840 |
+| gpt3-curie | 67.4 | 0.0807 |
+| neo-1.3B | 63.0 | 0.0522 |
+| neo-2.7B | 56.5 | 0.0684 |
+
+However, there was one interesting and unexpected result: there's basically no correlation between different models on which prompts do the best. This is highly unexpected because I'd expect *a priori* that models trained on the same/similar data should have similar preferences for what kinds of prompts work well, and that surely some prompts must be better than other prompts in general. 
+
+Here's what that looks like plotted out. Each point in these plots is one prompt, and the axes are different models. The values are SST accuracy:
+
+![](/images/research-log/fig_gpt2_pretrained-EleutherAI_gpt-neo-1.3B_gpt2_pretrained-EleutherAI_gpt-neo-2.7B.png)
+![](/images/research-log/fig_gpt3_engine-ada_gpt2_pretrained-EleutherAI_gpt-neo-1.3B.png)
+![](/images/research-log/fig_gpt3_engine-ada_gpt2_pretrained-EleutherAI_gpt-neo-2.7B.png)
+![](/images/research-log/fig_gpt3_engine-ada_gpt3_engine-babbage.png)
+![](/images/research-log/fig_gpt3_engine-ada_gpt3_engine-curie.png)
+![](/images/research-log/fig_gpt3_engine-babbage_gpt2_pretrained-EleutherAI_gpt-neo-1.3B.png)
+![](/images/research-log/fig_gpt3_engine-babbage_gpt2_pretrained-EleutherAI_gpt-neo-2.7B.png)
+![](/images/research-log/fig_gpt3_engine-babbage_gpt3_engine-curie.png)
+![](/images/research-log/fig_gpt3_engine-curie_gpt2_pretrained-EleutherAI_gpt-neo-1.3B.png)
+![](/images/research-log/fig_gpt3_engine-curie_gpt2_pretrained-EleutherAI_gpt-neo-2.7B.png)
+
+The code for the experiment is [here](https://gist.github.com/leogao2/d156d8e0f49ac83b239dde3819668b4b).