-
Notifications
You must be signed in to change notification settings - Fork 982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tuple index out of range in _exec_send_grads p2p.send #884
Comments
I also encountered the same problem as you, did you solve it now? |
I replaced the _exec_send_grads function that was reported as error in gpt-neox2.0 (deepspeed0.8.3) with the _exec_send_grads function in gpt-neox1.0 (deepspeed0.3.15). Now the program works fine. deepspeed0.8.3
deepspeed 0.3.15
|
@Zlzzzupup thanks for looking into this! I've created a Code Diff to help understand what's going on. I'm having trouble spotting any functional differences. In my understanding there are three blocks of differences:
Can you try restoring the current version of the code, and then introduce each of these changes in isolation? I'm very curious to see which one(s) break the code. |
@StellaAthena Thank for your reply, I reproduced the three parts of the modified code and located the problem in:
Other parts of the modification do not affect the running of the program. |
@Quentin-Anthony do you have any idea why this would cause an error? I can run some tests, but since |
I'm still discussing this with the DeepSpeed team. Please either apply microsoft/DeepSpeed#2538 or install from the latest DeeperSpeed, which already has this patch applied. In general, please use the latest DeeperSpeed for running gpt-neox. We use it as a staging ground for fixes like these before they get merged into upstream DeepSpeed. |
@drcege I’ve corrected |
Describe the bug
When setting both
pipe-parallel-size
andmodel-parallel-size
to 2, the training crashes. However, setting each individually to 2 (and keep the other as 1) works fine.The stack trace:
To Reproduce
Steps to reproduce the behavior:
leogao2/gpt-neox:sha-61b5eee
/job/hostfile
as followspython ./deepy.py train.py -d configs 1-3B.yml local_setup.yml
after modifyingdata-path
,vocab-file
, andmerge-file
Expected behavior
Should train without error.
Proposed solution
After debugging, I believe the error was triggered here:
https://github.com/EleutherAI/DeeperSpeed/blob/457850dc5ad72960f0e8a8f1597914d682a7792c/deepspeed/runtime/pipe/engine.py#L1023-L1025
It seems that the length of inputs is less than 2, so the indexing is out of range. Does this means the grad is not properly partitioned when
model-parallel-size>1
?I know the code comes from inside DeepSpeed, but these lines were written several years ago and have been used by many tools, suggesting that the error may be caused by incorrect external passing of NeoX.
Screenshots
If applicable, add screenshots to help explain your problem.
Environment (please complete the following information):
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: