Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training porblem? #58

Open
Wanghc233 opened this issue Nov 18, 2023 · 6 comments
Open

training porblem? #58

Wanghc233 opened this issue Nov 18, 2023 · 6 comments

Comments

@Wanghc233
Copy link

Wanghc233 commented Nov 18, 2023

When i train vae ,the following proble occured :
Using /home/whc/.cache/torch_extensions as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/whc/.cache/torch_extensions/emd_ext/build.ninja...
Building extension module emd_ext...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] /usr/local/cuda-11.0/bin/nvcc -DTORCH_EXTENSION_NAME=emd_ext -DTORCH_API_INCLUDE_EXTENSION_H -isystem /home/whc/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include -isystem /home/whc/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/whc/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/TH -isystem /home/whc/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda-11.0/include -isystem /home/whc/miniconda3/envs/lion_env/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -std=c++14 -c /home/whc/LION/third_party/PyTorchEMD/cuda/emd_kernel.cu -o emd_kernel.cuda.o
FAILED: emd_kernel.cuda.o
/usr/local/cuda-11.0/bin/nvcc -DTORCH_EXTENSION_NAME=emd_ext -DTORCH_API_INCLUDE_EXTENSION_H -isystem /home/whc/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include -isystem /home/whc/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/whc/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/TH -isystem /home/whc/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda-11.0/include -isystem /home/whc/miniconda3/envs/lion_env/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -std=c++14 -c /home/whc/LION/third_party/PyTorchEMD/cuda/emd_kernel.cu -o emd_kernel.cuda.o
nvcc fatal : Unsupported gpu architecture 'compute_86'
ninja: build stopped: subcommand failed.
2023-11-20 16:21:35.117 | ERROR | utils.utils:init_processes:1158 - An error has been caught in function 'init_processes', process 'MainProcess' (1741727), thread 'MainThread' (140035078779840):
Traceback (most recent call last):

File "/home/whc/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1516, in _run_ninja_build
subprocess.run(
│ └ <function run at 0x7f5c743be430>
└ <module 'subprocess' from '/home/whc/miniconda3/envs/lion_env/lib/python3.8/subprocess.py'>
File "/home/whc/miniconda3/envs/lion_env/lib/python3.8/subprocess.py", line 516, in run
raise CalledProcessError(retcode, process.args,
│ │ │ └ ['ninja', '-v']
│ │ └ <subprocess.Popen object at 0x7f5ab08a1520>
│ └ 1
└ <class 'subprocess.CalledProcessError'>

subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

File "train_dist.py", line 251, in
utils.init_processes(0, size, main, args, config)
│ │ │ │ │ └ CfgNode({'dpm_ckpt': '', 'clipforge': CfgNode({'clip_model': 'ViT-B/32', 'enable': 0, 'feat_dim': 512}), 'eval_trainnll': 0, ...
│ │ │ │ └ Namespace(autocast_eval=True, autocast_train=True, config='none', data='/tmp/nvae-diff/data', dataset='cifar10', distributed=...
│ │ │ └ <function main at 0x7f5bc982a160>
│ │ └ 1
│ └ <function init_processes at 0x7f5bc98263a0>
└ <module 'utils.utils' from '/home/whc/LION/utils/utils.py'>

File "/home/whc/LION/utils/utils.py", line 1158, in init_processes
fn(args, config)
│ │ └ CfgNode({'dpm_ckpt': '', 'clipforge': CfgNode({'clip_model': 'ViT-B/32', 'enable': 0, 'feat_dim': 512}), 'eval_trainnll': 0, ...
│ └ Namespace(autocast_eval=True, autocast_train=True, config='none', data='/tmp/nvae-diff/data', dataset='cifar10', distributed=...
└ <function main at 0x7f5bc982a160>

File "train_dist.py", line 31, in main
trainer_lib = importlib.import_module(config.trainer.type)
│ │ └ CfgNode({'dpm_ckpt': '', 'clipforge': CfgNode({'clip_model': 'ViT-B/32', 'enable': 0, 'feat_dim': 512}), 'eval_trainnll': 0, ...
│ └ <function import_module at 0x7f5c7481ad30>
└ <module 'importlib' from '/home/whc/miniconda3/envs/lion_env/lib/python3.8/importlib/init.py'>

File "/home/whc/miniconda3/envs/lion_env/lib/python3.8/importlib/init.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
│ │ │ │ │ └ 0
│ │ │ │ └ None
│ │ │ └ 0
│ │ └ 'trainers.hvae_trainer'
│ └ <function _gcd_import at 0x7f5c74943430>
└ <module 'importlib._bootstrap' (frozen)>
File "", line 1014, in _gcd_import
File "", line 991, in _find_and_load
File "", line 975, in _find_and_load_unlocked
File "", line 671, in _load_unlocked
File "", line 843, in exec_module
File "", line 219, in _call_with_frames_removed

File "/home/whc/LION/trainers/hvae_trainer.py", line 18, in
from trainers.base_trainer import BaseTrainer

File "/home/whc/LION/trainers/base_trainer.py", line 19, in
from utils.evaluation_metrics_fast import print_results

File "/home/whc/LION/utils/evaluation_metrics_fast.py", line 24, in
from third_party.PyTorchEMD.emd_nograd import earth_mover_distance_nograd

File "/home/whc/LION/third_party/PyTorchEMD/emd_nograd.py", line 4, in
from third_party.PyTorchEMD.backend import emd_cuda_dynamic as emd_cuda

File "/home/whc/LION/third_party/PyTorchEMD/backend.py", line 10, in
emd_cuda_dynamic = load(name='emd_ext',
└ <function load at 0x7f5aafd78280>

File "/home/whc/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 969, in load
return _jit_compile(
└ <function _jit_compile at 0x7f5aafd783a0>
File "/home/whc/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1176, in _jit_compile
_write_ninja_file_and_build_library(
└ <function _write_ninja_file_and_build_library at 0x7f5aafd784c0>
File "/home/whc/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1280, in _write_ninja_file_and_build_library
_run_ninja_build(
└ <function _run_ninja_build at 0x7f5aafd78940>
File "/home/whc/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1538, in _run_ninja_build
raise RuntimeError(message) from e
└ "Error building extension 'emd_ext'"

RuntimeError: Error building extension 'emd_ext'
if i input :
export TORCH_CUDA_ARCH_LIST="7.5"
the another problem may orrcued:
CUDA kernel failed : no kernel image is available for execution on the device
void avg_voxelize(int, int, int, int, int, int, const int*, const float*, int*, int*, float*) at L:118 in /home/whc/LION/third_party/pvcnn/functional/src/voxelization/vox.cu

@ZENGXH
Copy link
Collaborator

ZENGXH commented Dec 4, 2023

it seems the building of emd failed nvcc fatal : Unsupported gpu architecture 'compute_86'

  • what's the gpu are you using?
  • could you try if you are able to install thie repo? https://github.com/daerduoCarey/PyTorchEMD
  • it's possible to remove the requirement of EMD, it's only used during evaluation, not in training. This require comment out some related code calling & import EMD.

@Wanghc233
Copy link
Author

i use rtx3090,i input "export TORCH_CUDA_ARCH_LIST="8.0"can slove this "nvcc fatal : Unsupported gpu architecture 'compute_86'" problem.

@ZENGXH
Copy link
Collaborator

ZENGXH commented Jan 11, 2024

Thanks for the update!

@Wanghc233
Copy link
Author

哈哈哈祝大佬早日毕业~

@Philcalab
Copy link

i use rtx3090,i input "export TORCH_CUDA_ARCH_LIST="8.0"can slove this "nvcc fatal : Unsupported gpu architecture 'compute_86'" problem.

请问export TORCH_CUDA_ARCH_LIST="8.0"是放在bashrc文件的最后吗

@Wanghc233
Copy link
Author

终端输一次就行,不用改全局环境,不然你的其他代码可能运行不了,先在终端输入export TORCH_CUDA_ARCH_LIST="8.0"回车后在输入python train.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants