You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
when i run the command: docker build -t gpt-neox -f Dockerfile . i get nvcc-not-found error.
just like this
`Collecting deepspeed (from -r requirements.txt (line 2))
Cloning https://github.com/EleutherAI/DeeperSpeed.git to /tmp/pip-install-ami7m0w2/deepspeed_b51158c86bfd44cb85491c473a45e40a
Running command git clone --filter=blob:none --quiet https://github.com/EleutherAI/DeeperSpeed.git /tmp/pip-install-ami7m0w2/deepspeed_b51158c86bfd44cb85491c473a45e40a
Resolved https://github.com/EleutherAI/DeeperSpeed.git to commit 22fda1e0ee462c2b411575dc954cc8a29d78a7b2
Preparing metadata (setup.py): started
Preparing metadata (setup.py): finished with status 'error'
error: subprocess-exited-with-error
× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [21 lines of output]
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
Traceback (most recent call last):
File "", line 2, in
File "", line 34, in
File "/tmp/pip-install-ami7m0w2/deepspeed_b51158c86bfd44cb85491c473a45e40a/setup.py", line 119, in
os.environ["TORCH_CUDA_ARCH_LIST"] = get_default_compute_capabilities()
File "/tmp/pip-install-ami7m0w2/deepspeed_b51158c86bfd44cb85491c473a45e40a/op_builder/builder.py", line 55, in get_default_compute_capabilities
if torch.utils.cpp_extension.CUDA_HOME is not None and installed_cuda_version()[0] >= 11:
File "/tmp/pip-install-ami7m0w2/deepspeed_b51158c86bfd44cb85491c473a45e40a/op_builder/builder.py", line 43, in installed_cuda_version
output = subprocess.check_output([cuda_home + "/bin/nvcc", "-V"], universal_newlines=True)
File "/usr/lib/python3.8/subprocess.py", line 415, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File "/usr/lib/python3.8/subprocess.py", line 493, in run
with Popen(*popenargs, **kwargs) as process:
File "/usr/lib/python3.8/subprocess.py", line 858, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File "/usr/lib/python3.8/subprocess.py", line 1704, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename) FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/cuda/bin/nvcc'
Setting ds_accelerator to cuda (auto detect)
[WARNING] Torch did not find cuda available, if cross-compiling or running with cpu only you can ignore this message. Adding compute capability for Pascal, Volta, and Turing (compute capabilities 6
.0, 6.1, 6.2)
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed
× Encountered error while generating package metadata.
╰─> See above for output.
note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
`
changingivan
changed the title
Bug: nvcc does not exists in runtime version of nvidia base image used by Dockerfile
Bug: nvcc does not exists in runtime version of nvidia base image used in Dockerfile
Sep 8, 2023
Describe the bug
when i run the command: docker build -t gpt-neox -f Dockerfile . i get nvcc-not-found error.
just like this
× Encountered error while generating package metadata.
╰─> See above for output.
note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
`
Proposed solution
It's a bug, because the runtime version of nvidia/cuda:11.7.1-runtime-ubuntu20.04 doesn't not have nvcc, nvidia/cuda:11.7.1-devel-ubuntu20.04 should be used here. https://github.com/EleutherAI/gpt-neox/blob/main/Dockerfile#L15C37-L15C38
The text was updated successfully, but these errors were encountered: