-
Notifications
You must be signed in to change notification settings - Fork 983
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
About gpt-neox-20B model hyperparameter #989
Comments
There is no compelling evidence for what precise width-to-depth ratio is best. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
A confuse about 20B model hyperparameter with hidden-size=6144, num-attention-heads=64, num-layers=44, but LLaMA or GPT model have different model hyperparameter, LLaMA-65B hidden-size=8192, num-attention-heads=64, num-layers=80, it seems that LLaMA deeper and gpt-neox-20B wider? which model hyperparameter is better?
Thank you~
The text was updated successfully, but these errors were encountered: