Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: nvcc does not exists in runtime version of nvidia base image used in Dockerfile #1021

Closed
changingivan opened this issue Sep 8, 2023 · 2 comments
Labels
bug Something isn't working

Comments

@changingivan
Copy link

Describe the bug
when i run the command: docker build -t gpt-neox -f Dockerfile . i get nvcc-not-found error.

just like this

`Collecting deepspeed (from -r requirements.txt (line 2))
Cloning https://github.com/EleutherAI/DeeperSpeed.git to /tmp/pip-install-ami7m0w2/deepspeed_b51158c86bfd44cb85491c473a45e40a
Running command git clone --filter=blob:none --quiet https://github.com/EleutherAI/DeeperSpeed.git /tmp/pip-install-ami7m0w2/deepspeed_b51158c86bfd44cb85491c473a45e40a
Resolved https://github.com/EleutherAI/DeeperSpeed.git to commit 22fda1e0ee462c2b411575dc954cc8a29d78a7b2
Preparing metadata (setup.py): started
Preparing metadata (setup.py): finished with status 'error'
error: subprocess-exited-with-error
× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [21 lines of output]
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
Traceback (most recent call last):
File "", line 2, in
File "", line 34, in
File "/tmp/pip-install-ami7m0w2/deepspeed_b51158c86bfd44cb85491c473a45e40a/setup.py", line 119, in
os.environ["TORCH_CUDA_ARCH_LIST"] = get_default_compute_capabilities()
File "/tmp/pip-install-ami7m0w2/deepspeed_b51158c86bfd44cb85491c473a45e40a/op_builder/builder.py", line 55, in get_default_compute_capabilities
if torch.utils.cpp_extension.CUDA_HOME is not None and installed_cuda_version()[0] >= 11:
File "/tmp/pip-install-ami7m0w2/deepspeed_b51158c86bfd44cb85491c473a45e40a/op_builder/builder.py", line 43, in installed_cuda_version
output = subprocess.check_output([cuda_home + "/bin/nvcc", "-V"], universal_newlines=True)
File "/usr/lib/python3.8/subprocess.py", line 415, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File "/usr/lib/python3.8/subprocess.py", line 493, in run
with Popen(*popenargs, **kwargs) as process:
File "/usr/lib/python3.8/subprocess.py", line 858, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File "/usr/lib/python3.8/subprocess.py", line 1704, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/cuda/bin/nvcc'
Setting ds_accelerator to cuda (auto detect)
[WARNING] Torch did not find cuda available, if cross-compiling or running with cpu only you can ignore this message. Adding compute capability for Pascal, Volta, and Turing (compute capabilities 6
.0, 6.1, 6.2)
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
`

Proposed solution
It's a bug, because the runtime version of nvidia/cuda:11.7.1-runtime-ubuntu20.04 doesn't not have nvcc, nvidia/cuda:11.7.1-devel-ubuntu20.04 should be used here. https://github.com/EleutherAI/gpt-neox/blob/main/Dockerfile#L15C37-L15C38

@changingivan changingivan added the bug Something isn't working label Sep 8, 2023
@changingivan changingivan changed the title Bug: nvcc does not exists in runtime version of nvidia base image used by Dockerfile Bug: nvcc does not exists in runtime version of nvidia base image used in Dockerfile Sep 8, 2023
This was referenced Jan 4, 2024
@segyges
Copy link
Contributor

segyges commented Jan 4, 2024

This is resolved in my testing by #1106.

@Quentin-Anthony
Copy link
Member

This is resolved in my testing by #1106.

Agree. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants