-
Notifications
You must be signed in to change notification settings - Fork 982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dockerfile error #933
Comments
If you'd like to use CUDA 11.7, change it locally. I don't think a global update to the Dockerfile CUDA version is necessary. |
So, what should be done about this error if it is not caused by the CUDA version? |
You can either change your local copy of the file to say CUDA 11.7, or you can install a version of PyTorch compiled with CUDA 11.1. |
It is caused by the CUDA version, but with your local torch CUDA version. I'm recommending that you personally switch to the CUDA 11.7 NVIDIA Docker image ( |
I have changed the CUDA version to 11.7, and now I am encountering the following error.
|
When it says |
Piggybacking off @StellaAthena's comment, the specific issue is with Apex:
Your possible fixes here are to either comment out this check in apex's |
I managed to start training using the 6-7B.yml configuration, but when I try to use flash-atention and add the following to the config: "attention_config": [[["flash"], 32]], I encounter the following error. I should mention that I'm working with 4 A100 cards. What can fix this error?
|
This seems to be an issue with version mismatches still, see pytorch/pytorch#91186 |
Closing due to inactivity. Feel free to re-open if this is encountered again! |
I used git clone to download this repository and then downloaded the Slim weights. Next, I built the image and ran the container. I intended to generate text by executing the command ./deepy.py generate.py ./configs/20B.yml, but I encountered the following error:
And after: pip install /lustre/scratch/tmp/1503311/gpt-neox/megatron/fused_kernels I got this:
Can you tell me what can be a solutioin?
Also it seems to me that there is a mistake in Dockerfile in 15 line:
FROM nvidia/cuda:11.1.1-devel-ubuntu20.04
there is cuda:11.1 but i think there sould be cuda:11.7
The text was updated successfully, but these errors were encountered: