training porblem? #58

Wanghc233 · 2023-11-18T07:07:35Z

When i train vae ，the following proble occured :
Using /home/whc/.cache/torch_extensions as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/whc/.cache/torch_extensions/emd_ext/build.ninja...
Building extension module emd_ext...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] /usr/local/cuda-11.0/bin/nvcc -DTORCH_EXTENSION_NAME=emd_ext -DTORCH_API_INCLUDE_EXTENSION_H -isystem /home/whc/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include -isystem /home/whc/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/whc/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/TH -isystem /home/whc/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda-11.0/include -isystem /home/whc/miniconda3/envs/lion_env/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -std=c++14 -c /home/whc/LION/third_party/PyTorchEMD/cuda/emd_kernel.cu -o emd_kernel.cuda.o
FAILED: emd_kernel.cuda.o
/usr/local/cuda-11.0/bin/nvcc -DTORCH_EXTENSION_NAME=emd_ext -DTORCH_API_INCLUDE_EXTENSION_H -isystem /home/whc/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include -isystem /home/whc/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/whc/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/TH -isystem /home/whc/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda-11.0/include -isystem /home/whc/miniconda3/envs/lion_env/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -std=c++14 -c /home/whc/LION/third_party/PyTorchEMD/cuda/emd_kernel.cu -o emd_kernel.cuda.o
nvcc fatal : Unsupported gpu architecture 'compute_86'
ninja: build stopped: subcommand failed.
2023-11-20 16:21:35.117 | ERROR | utils.utils:init_processes:1158 - An error has been caught in function 'init_processes', process 'MainProcess' (1741727), thread 'MainThread' (140035078779840):
Traceback (most recent call last):

File "/home/whc/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1516, in _run_ninja_build
subprocess.run(
│ └ <function run at 0x7f5c743be430>
└ <module 'subprocess' from '/home/whc/miniconda3/envs/lion_env/lib/python3.8/subprocess.py'>
File "/home/whc/miniconda3/envs/lion_env/lib/python3.8/subprocess.py", line 516, in run
raise CalledProcessError(retcode, process.args,
│ │ │ └ ['ninja', '-v']
│ │ └ <subprocess.Popen object at 0x7f5ab08a1520>
│ └ 1
└ <class 'subprocess.CalledProcessError'>

subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

File "train_dist.py", line 251, in
utils.init_processes(0, size, main, args, config)
│ │ │ │ │ └ CfgNode({'dpm_ckpt': '', 'clipforge': CfgNode({'clip_model': 'ViT-B/32', 'enable': 0, 'feat_dim': 512}), 'eval_trainnll': 0, ...
│ │ │ │ └ Namespace(autocast_eval=True, autocast_train=True, config='none', data='/tmp/nvae-diff/data', dataset='cifar10', distributed=...
│ │ │ └ <function main at 0x7f5bc982a160>
│ │ └ 1
│ └ <function init_processes at 0x7f5bc98263a0>
└ <module 'utils.utils' from '/home/whc/LION/utils/utils.py'>

File "/home/whc/LION/utils/utils.py", line 1158, in init_processes
fn(args, config)
│ │ └ CfgNode({'dpm_ckpt': '', 'clipforge': CfgNode({'clip_model': 'ViT-B/32', 'enable': 0, 'feat_dim': 512}), 'eval_trainnll': 0, ...
│ └ Namespace(autocast_eval=True, autocast_train=True, config='none', data='/tmp/nvae-diff/data', dataset='cifar10', distributed=...
└ <function main at 0x7f5bc982a160>

File "train_dist.py", line 31, in main
trainer_lib = importlib.import_module(config.trainer.type)
│ │ └ CfgNode({'dpm_ckpt': '', 'clipforge': CfgNode({'clip_model': 'ViT-B/32', 'enable': 0, 'feat_dim': 512}), 'eval_trainnll': 0, ...
│ └ <function import_module at 0x7f5c7481ad30>
└ <module 'importlib' from '/home/whc/miniconda3/envs/lion_env/lib/python3.8/importlib/init.py'>

File "/home/whc/miniconda3/envs/lion_env/lib/python3.8/importlib/init.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
│ │ │ │ │ └ 0
│ │ │ │ └ None
│ │ │ └ 0
│ │ └ 'trainers.hvae_trainer'
│ └ <function _gcd_import at 0x7f5c74943430>
└ <module 'importlib._bootstrap' (frozen)>
File "", line 1014, in _gcd_import
File "", line 991, in _find_and_load
File "", line 975, in _find_and_load_unlocked
File "", line 671, in _load_unlocked
File "", line 843, in exec_module
File "", line 219, in _call_with_frames_removed

File "/home/whc/LION/trainers/hvae_trainer.py", line 18, in
from trainers.base_trainer import BaseTrainer

File "/home/whc/LION/trainers/base_trainer.py", line 19, in
from utils.evaluation_metrics_fast import print_results

File "/home/whc/LION/utils/evaluation_metrics_fast.py", line 24, in
from third_party.PyTorchEMD.emd_nograd import earth_mover_distance_nograd

File "/home/whc/LION/third_party/PyTorchEMD/emd_nograd.py", line 4, in
from third_party.PyTorchEMD.backend import emd_cuda_dynamic as emd_cuda

File "/home/whc/LION/third_party/PyTorchEMD/backend.py", line 10, in
emd_cuda_dynamic = load(name='emd_ext',
└ <function load at 0x7f5aafd78280>

File "/home/whc/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 969, in load
return _jit_compile(
└ <function _jit_compile at 0x7f5aafd783a0>
File "/home/whc/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1176, in _jit_compile
_write_ninja_file_and_build_library(
└ <function _write_ninja_file_and_build_library at 0x7f5aafd784c0>
File "/home/whc/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1280, in _write_ninja_file_and_build_library
_run_ninja_build(
└ <function _run_ninja_build at 0x7f5aafd78940>
File "/home/whc/miniconda3/envs/lion_env/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1538, in _run_ninja_build
raise RuntimeError(message) from e
└ "Error building extension 'emd_ext'"

RuntimeError: Error building extension 'emd_ext'
if i input :
export TORCH_CUDA_ARCH_LIST="7.5"
the another problem may orrcued:
CUDA kernel failed : no kernel image is available for execution on the device
void avg_voxelize(int, int, int, int, int, int, const int*, const float*, int*, int*, float*) at L:118 in /home/whc/LION/third_party/pvcnn/functional/src/voxelization/vox.cu

ZENGXH · 2023-12-04T16:46:10Z

it seems the building of emd failed nvcc fatal : Unsupported gpu architecture 'compute_86'

what's the gpu are you using?
could you try if you are able to install thie repo? https://github.com/daerduoCarey/PyTorchEMD
it's possible to remove the requirement of EMD, it's only used during evaluation, not in training. This require comment out some related code calling & import EMD.

Wanghc233 · 2023-12-05T03:32:35Z

i use rtx3090,i input "export TORCH_CUDA_ARCH_LIST="8.0"can slove this "nvcc fatal : Unsupported gpu architecture 'compute_86'" problem.

ZENGXH · 2024-01-11T09:06:42Z

Thanks for the update!

Wanghc233 · 2024-01-11T09:08:57Z

哈哈哈祝大佬早日毕业~

Philcalab · 2024-01-16T08:44:46Z

i use rtx3090,i input "export TORCH_CUDA_ARCH_LIST="8.0"can slove this "nvcc fatal : Unsupported gpu architecture 'compute_86'" problem.

请问export TORCH_CUDA_ARCH_LIST="8.0"是放在bashrc文件的最后吗

Wanghc233 · 2024-01-16T08:55:24Z

终端输一次就行，不用改全局环境，不然你的其他代码可能运行不了，先在终端输入export TORCH_CUDA_ARCH_LIST="8.0"回车后在输入python train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training porblem? #58

training porblem? #58

Wanghc233 commented Nov 18, 2023 •

edited

Loading

ZENGXH commented Dec 4, 2023

Wanghc233 commented Dec 5, 2023

ZENGXH commented Jan 11, 2024

Wanghc233 commented Jan 11, 2024

Philcalab commented Jan 16, 2024

Wanghc233 commented Jan 16, 2024

training porblem? #58

training porblem? #58

Comments

Wanghc233 commented Nov 18, 2023 • edited Loading

ZENGXH commented Dec 4, 2023

Wanghc233 commented Dec 5, 2023

ZENGXH commented Jan 11, 2024

Wanghc233 commented Jan 11, 2024

Philcalab commented Jan 16, 2024

Wanghc233 commented Jan 16, 2024

Wanghc233 commented Nov 18, 2023 •

edited

Loading