-
Notifications
You must be signed in to change notification settings - Fork 982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Training followed tutorial] error: exits with return code = -7 #940
Comments
The error appears to be
The reason that this is an issue is detailed here. The quoted passage tells you how to bypass this check, but if you are using a shared computer (e.g., a university cluster) you should not do so without thinking about it very carefully. The most likely core explanation is that something in the permissions of your computer are misconfigured. |
Hi @StellaAthena Here is my logs:
|
@StellaAthena Thank for your support. I have found my problems. |
Hi @phamkhactu , would you be able to share how you fixed this issue? I'm running into the same problems. |
It means that some packages not compatible with env. You should build docker or pull image from sharing docker. |
Hi, sorry for bumping but I had similar error with the same return code with no detailed explanation. I was running GPT-NeoX in a Docker container in local k8s. My solution was to increase the shm size of container as it's noted in the README and NCCL's docs. Cheers. |
Thanks for excellent repo
I follow tutorial to train models, but I get error
My steps
pip install -r requirements/requirements.txt -->python ./megatron/fused_kernels/setup.py install
python prepare_data.py -d ./data
python ./deepy.py train.py -d configs bf16_125M.yml local_setup.yml
Environment:
Here is my full logs
The text was updated successfully, but these errors were encountered: