The performance about pythia and LLaMA model architecture #122

peiyingxin · 2023-10-12T12:38:03Z

Hi,
first of all, thanks for your great contributions to open research!
I have confused about model architecture will influence model performance, I note that pythia model Layer Block like

pseudocode:
x = x + attn(ln1(x)) + mlp(ln2(x))

and GPT or LLaMA Layer Block like

pseudocode:
x = x + attn(ln1(x))
x = x + mlp(ln2(x))

have you been test performance about model architecture difference?

StellaAthena · 2023-10-21T13:23:33Z

Yes! Putting the MLP and attention layers in parallel is known to not hurt performance at scale while providing a substantial increase in training speed. It was introduced by GPT-J-6 and has previously been used by GPT-NeoX-20B, PaLM 1 and 2, ViT-22B, and many more. Experiments at different labs consistently report a 15% speed-up in training.

Its generally reported on without a full ablation, but the PaLM 1 paper and GPT-NeoX-20B paper both describe experiments showing this.

StellaAthena closed this as completed Oct 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The performance about pythia and LLaMA model architecture #122

The performance about pythia and LLaMA model architecture #122

peiyingxin commented Oct 12, 2023 •

edited

Loading

StellaAthena commented Oct 21, 2023 •

edited

Loading

The performance about pythia and LLaMA model architecture #122

The performance about pythia and LLaMA model architecture #122

Comments

peiyingxin commented Oct 12, 2023 • edited Loading

StellaAthena commented Oct 21, 2023 • edited Loading

peiyingxin commented Oct 12, 2023 •

edited

Loading

StellaAthena commented Oct 21, 2023 •

edited

Loading