-
Notifications
You must be signed in to change notification settings - Fork 982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
misindexing when converting llama weights to gpt-neox format #971
Comments
try to remove the "+2" of this line.
|
@CRSilkworth can you check if the code on the llama-conversion branch works for you? |
@StellaAthena It looks like that solves the original issue but there is another somewhat unrelated issue, which I believe is due a deepspeed update that gets pulled when installing gpt-neox from scratch. It looks like this line was added in the latest deepspeed, which assumes a 'module' key in the checkpoint dict.
I can get it to load if I set this line to 'None' instead of an empty dict. Not sure if that's kosher? I suspect there is some kind of recursion going on, although I'm not very familiar with this code so it's a little hard to follow. |
Probably a question best posed to @Quentin-Anthony |
I'll take a look. |
@CRSilkworth This error should be able to be fixed by either passing I can make a PR to make |
@haileyschoelkopf Actually, this error occurs when setting --pipeline_parallel for tools/convert_raw_llama_weights_to_neox.py and then running with pipe_parallel_size > 1. |
I got the same error. Is there any updates? |
Yes, the most recent version ( #1124 ) of the conversion script should no longer have this error—have tested both round-trip conversion and training. |
I also encountered this problem. My deepspeed branch is bf16_zero1, and I found some changes in the new version of the code. If you know how to modify it, please teach me. Thank you. |
@linjiadegou2 when running the If pipeline parallel size is set to 0, then the checkpoint save/load format is different and neox tries to load from this "module" key, whereas if pipeline parallel is being used then the weights are saved and loaded from per-layer files. |
I converted the raw llama2 parameters to a format supported by NEOX using convert_raw_llama_weights_to_neox.py and I used --pipeline_parallel, and in my configuration file, "pipe_parallel_size" : 1. The problem still arises. |
Could you open a new issue for this? I'll have to try to replicate this. |
Describe the bug
After running the convert_raw_llama_weights_to_neox.py with --pipeline_parallel the checkpoint are missing the 2nd and 3rd layers (i.e.):
layer_02-model_-model_states.pt
layer_03-model_-model_states.pt
The first layer files after the layer_00-model_* are the layer_04-model_* files. But the other gpt-neox checkpoints have the layer_02 and layer_03 files, and is what the GPTModelPipe is expecting.
This causes error when loading model for training / inference since those weights are not found.
To Reproduce
Which I'm pretty sure occurs because when it tries to load the layer_01_* and layer_02_* checkpoint files.
Expected behavior
Checkpoints should load successfully.
Proposed solution
I believe the issue happened by accidentally adding 'layer_i + 2' in two locations instead of the one here and here)
I would just take out the second one, so that the pipeline_parallel version matches more closely to the sequential version.
Environment (please complete the following information):
my 7B llama config
The text was updated successfully, but these errors were encountered: