Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The performance about pythia and LLaMA model architecture #122

Closed
peiyingxin opened this issue Oct 12, 2023 · 1 comment
Closed

The performance about pythia and LLaMA model architecture #122

peiyingxin opened this issue Oct 12, 2023 · 1 comment

Comments

@peiyingxin
Copy link

peiyingxin commented Oct 12, 2023

Hi,
first of all, thanks for your great contributions to open research!
I have confused about model architecture will influence model performance, I note that pythia model Layer Block like

pseudocode:
x = x + attn(ln1(x)) + mlp(ln2(x))

and GPT or LLaMA Layer Block like

pseudocode:
x = x + attn(ln1(x))
x = x + mlp(ln2(x))

have you been test performance about model architecture difference?

@StellaAthena
Copy link
Member

StellaAthena commented Oct 21, 2023

Yes! Putting the MLP and attention layers in parallel is known to not hurt performance at scale while providing a substantial increase in training speed. It was introduced by GPT-J-6 and has previously been used by GPT-NeoX-20B, PaLM 1 and 2, ViT-22B, and many more. Experiments at different labs consistently report a 15% speed-up in training.

Its generally reported on without a full ablation, but the PaLM 1 paper and GPT-NeoX-20B paper both describe experiments showing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants