Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run_funsd.py fails with NCCL error in: /opt/conda/conda-bld/pytorch_1607370156314/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8 #18

Closed
unreal91 opened this issue Sep 13, 2022 · 2 comments

Comments

@unreal91
Copy link

CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --nproc_per_node=4 run_funsd.py --model_name_or_path lilt-roberta-en-base --tokenizer_name roberta-base --output_dir ser_funsd_lilt-roberta-en-base --do_train --do_predict --max_steps 2000 --per_device_train_batch_size 8 --warmup_ratio 0.1 --fp16

Above command fails with below error for pytorch 1.7.1 Cuda 11.0

Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "run_funsd.py", line 369, in <module>
  File "run_funsd.py", line 369, in <module>
  File "run_funsd.py", line 369, in <module>
  File "run_funsd.py", line 369, in <module>
        main()main()

  File "run_funsd.py", line 50, in main
  File "run_funsd.py", line 50, in main
        model_args, data_args, training_args = parser.parse_args_into_dataclasses()main()    

main()  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/hf_argparser.py", line 187, in parse_args_into_dataclasses

  File "run_funsd.py", line 50, in main
  File "run_funsd.py", line 50, in main
    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/hf_argparser.py", line 187, in parse_args_into_dataclasses
    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
    model_args, data_args, training_args = parser.parse_args_into_dataclasses()  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/hf_argparser.py", line 187, in parse_args_into_dataclasses

  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/hf_argparser.py", line 187, in parse_args_into_dataclasses
        obj = dtype(**inputs)obj = dtype(**inputs)

      File "<string>", line 67, in __init__
obj = dtype(**inputs)  File "<string>", line 67, in __init__
    
obj = dtype(**inputs)
  File "<string>", line 67, in __init__
  File "<string>", line 67, in __init__
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/training_args.py", line 570, in __post_init__
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/training_args.py", line 570, in __post_init__
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/training_args.py", line 570, in __post_init__
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/training_args.py", line 570, in __post_init__
        if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):

  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/file_utils.py", line 1470, in wrapper
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/file_utils.py", line 1470, in wrapper
        if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):

  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/file_utils.py", line 1470, in wrapper
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/file_utils.py", line 1470, in wrapper
    return func(*args, **kwargs)
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/training_args.py", line 717, in device
    return func(*args, **kwargs)
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/training_args.py", line 717, in device
    return func(*args, **kwargs)
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/training_args.py", line 717, in device
    return func(*args, **kwargs)
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/training_args.py", line 717, in device
    return self._setup_devices
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/file_utils.py", line 1460, in __get__
        return self._setup_devicesreturn self._setup_devices

  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/file_utils.py", line 1460, in __get__
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/file_utils.py", line 1460, in __get__
    return self._setup_devices
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/file_utils.py", line 1460, in __get__
    cached = self.fget(obj)
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/file_utils.py", line 1470, in wrapper
    cached = self.fget(obj)
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/file_utils.py", line 1470, in wrapper
    cached = self.fget(obj)    
cached = self.fget(obj)
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/file_utils.py", line 1470, in wrapper
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/file_utils.py", line 1470, in wrapper
    return func(*args, **kwargs)
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/training_args.py", line 702, in _setup_devices
    return func(*args, **kwargs)
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/training_args.py", line 702, in _setup_devices
    return func(*args, **kwargs)
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/training_args.py", line 702, in _setup_devices
    return func(*args, **kwargs)
    torch.distributed.init_process_group(backend="nccl")
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/transformers/training_args.py", line 702, in _setup_devices
      File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
torch.distributed.init_process_group(backend="nccl")
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
    torch.distributed.init_process_group(backend="nccl")
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
    torch.distributed.init_process_group(backend="nccl")
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
    barrier()
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
    barrier()    
barrier()
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
    barrier()
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
        work = _default_pg.barrier()work = _default_pg.barrier()
    
work = _default_pg.barrier()
RuntimeErrorRuntimeErrorRuntimeError: : : NCCL error in: /opt/conda/conda-bld/pytorch_1607370156314/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8NCCL error in: /opt/conda/conda-bld/pytorch_1607370156314/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8
NCCL error in: /opt/conda/conda-bld/pytorch_1607370156314/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8

    work = _default_pg.barrier()
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1607370156314/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8
Traceback (most recent call last):
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/torch/distributed/launch.py", line 260, in <module>
    main()
  File "/home/cydal/anaconda3/envs/liltfinetune/lib/python3.7/site-packages/torch/distributed/launch.py", line 256, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/home/cydal/anaconda3/envs/liltfinetune/bin/python', '-u', 'run_funsd.py', '--local_rank=3', '--model_name_or_path', 'lilt-roberta-en-base', '--tokenizer_name', 'roberta-base', '--output_dir', 'ser_funsd_lilt-roberta-en-base', '--do_train', '--do_predict', '--max_steps', '2000', '--per_device_train_batch_size', '8', '--warmup_ratio', '0.1', '--fp16']' returned non-zero exit status 1.

Below is conda list:

# packages in environment at /home/cydal/anaconda3/envs/liltfinetune:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
_openmp_mutex             5.1                       1_gnu  
absl-py                   1.2.0                    pypi_0    pypi
antlr4-python3-runtime    4.9.3                    pypi_0    pypi
appdirs                   1.4.4                    pypi_0    pypi
astunparse                1.6.3                      py_0  
black                     21.4b2                   pypi_0    pypi
blas                      1.0                         mkl  
brotlipy                  0.7.0           py37h27cfd23_1003  
bzip2                     1.0.8                h7b6447c_0  
c-ares                    1.18.1               h7f8727e_0  
ca-certificates           2022.07.19           h06a4308_0  
cachetools                5.2.0                    pypi_0    pypi
certifi                   2022.6.15        py37h06a4308_0  
cffi                      1.15.1           py37h74dc2b5_0  
charset-normalizer        2.1.1                    pypi_0    pypi
click                     8.1.3                    pypi_0    pypi
cloudpickle               2.2.0                    pypi_0    pypi
cmake                     3.19.6               h973ab73_0  
cryptography              37.0.1           py37h9ce1e76_0  
cudatoolkit               11.0.221             h6bb024c_0  
cycler                    0.11.0                   pypi_0    pypi
dataclasses               0.8                pyh6d0b6a4_7  
datasets                  1.6.2                    pypi_0    pypi
detectron2                0.5+cu110                pypi_0    pypi
dill                      0.3.5.1                  pypi_0    pypi
expat                     2.4.4                h295c915_0  
filelock                  3.8.0                    pypi_0    pypi
fonttools                 4.37.1                   pypi_0    pypi
freetype                  2.11.0               h70c0345_0  
fsspec                    2022.8.2                 pypi_0    pypi
future                    0.18.2                   py37_1  
fvcore                    0.1.5.post20220512          pypi_0    pypi
giflib                    5.2.1                h7b6447c_0  
google-auth               2.11.0                   pypi_0    pypi
google-auth-oauthlib      0.4.6                    pypi_0    pypi
grpcio                    1.48.1                   pypi_0    pypi
huggingface-hub           0.0.19                   pypi_0    pypi
hydra-core                1.2.0                    pypi_0    pypi
idna                      3.3                pyhd3eb1b0_0  
importlib-metadata        4.12.0                   pypi_0    pypi
importlib-resources       5.9.0                    pypi_0    pypi
intel-openmp              2021.4.0          h06a4308_3561  
iopath                    0.1.8                    pypi_0    pypi
joblib                    1.1.0                    pypi_0    pypi
jpeg                      9b                   h024ee3a_2  
kiwisolver                1.4.4                    pypi_0    pypi
krb5                      1.19.2               hac12032_0  
lcms2                     2.12                 h3be6417_0  
ld_impl_linux-64          2.38                 h1181459_1  
libcurl                   7.84.0               h91b91d3_0  
libedit                   3.1.20210910         h7f8727e_0  
libev                     4.33                 h7f8727e_1  
libffi                    3.3                  he6710b0_2  
libgcc-ng                 11.2.0               h1234567_1  
libgomp                   11.2.0               h1234567_1  
libnghttp2                1.46.0               hce63b2e_0  
libpng                    1.6.37               hbc83047_0  
libssh2                   1.10.0               h8f2d780_0  
libstdcxx-ng              11.2.0               h1234567_1  
libtiff                   4.1.0                h2733197_1  
libuv                     1.40.0               h7b6447c_0  
libwebp                   1.2.0                h89dd481_0  
liltfinetune              1.0                      pypi_0    pypi
lz4-c                     1.9.3                h295c915_1  
magma-cuda110             2.5.2                         1    pytorch
markdown                  3.4.1                    pypi_0    pypi
markupsafe                2.1.1                    pypi_0    pypi
matplotlib                3.5.3                    pypi_0    pypi
mkl                       2021.4.0           h06a4308_640  
mkl-include               2022.1.0           h06a4308_224  
mkl-service               2.4.0            py37h7f8727e_0  
mkl_fft                   1.3.1            py37hd3c417c_0  
mkl_random                1.2.2            py37h51133e4_0  
multiprocess              0.70.13                  pypi_0    pypi
mypy-extensions           0.4.3                    pypi_0    pypi
ncurses                   6.3                  h5eee18b_3  
ninja                     1.10.2               h06a4308_5  
ninja-base                1.10.2               hd09550d_5  
numpy                     1.21.6                   pypi_0    pypi
numpy-base                1.21.5           py37ha15fc14_3  
oauthlib                  3.2.1                    pypi_0    pypi
omegaconf                 2.2.3                    pypi_0    pypi
openssl                   1.1.1q               h7f8727e_0  
packaging                 21.3                     pypi_0    pypi
pandas                    1.3.5                    pypi_0    pypi
pathspec                  0.10.1                   pypi_0    pypi
pillow                    9.2.0                    pypi_0    pypi
pip                       22.1.2           py37h06a4308_0  
portalocker               2.5.1                    pypi_0    pypi
protobuf                  3.19.4                   pypi_0    pypi
pyarrow                   9.0.0                    pypi_0    pypi
pyasn1                    0.4.8                    pypi_0    pypi
pyasn1-modules            0.2.8                    pypi_0    pypi
pycocotools               2.0.4                    pypi_0    pypi
pycparser                 2.21               pyhd3eb1b0_0  
pydot                     1.4.2                    pypi_0    pypi
pyopenssl                 22.0.0             pyhd3eb1b0_0  
pyparsing                 3.0.9                    pypi_0    pypi
pysocks                   1.7.1                    py37_1  
python                    3.7.13               h12debd9_0  
python-dateutil           2.8.2                    pypi_0    pypi
pytorch                   1.7.1           py3.7_cuda11.0.221_cudnn8.0.5_0    pytorch
pytz                      2022.2.1                 pypi_0    pypi
pyyaml                    6.0                      pypi_0    pypi
readline                  8.1.2                h7f8727e_1  
regex                     2022.9.13                pypi_0    pypi
requests                  2.28.1           py37h06a4308_0  
requests-oauthlib         1.3.1                    pypi_0    pypi
rhash                     1.4.1                h3c74f83_1  
rsa                       4.9                      pypi_0    pypi
sacremoses                0.0.53                   pypi_0    pypi
scikit-learn              1.0.2                    pypi_0    pypi
scipy                     1.7.3                    pypi_0    pypi
seqeval                   1.2.2                    pypi_0    pypi
setuptools                63.4.1           py37h06a4308_0  
six                       1.16.0             pyhd3eb1b0_1  
sqlite                    3.39.2               h5082296_0  
tabulate                  0.8.10                   pypi_0    pypi
tensorboard               2.10.0                   pypi_0    pypi
tensorboard-data-server   0.6.1                    pypi_0    pypi
tensorboard-plugin-wit    1.8.1                    pypi_0    pypi
termcolor                 2.0.1                    pypi_0    pypi
threadpoolctl             3.1.0                    pypi_0    pypi
tk                        8.6.12               h1ccaba5_0  
tokenizers                0.10.3                   pypi_0    pypi
toml                      0.10.2                   pypi_0    pypi
torch                     1.7.1+cu110              pypi_0    pypi
torchaudio                0.7.2                    pypi_0    pypi
torchvision               0.8.2+cu110              pypi_0    pypi
tqdm                      4.49.0                   pypi_0    pypi
transformers              4.5.1                    pypi_0    pypi
typed-ast                 1.5.4                    pypi_0    pypi
typing_extensions         4.3.0            py37h06a4308_0  
urllib3                   1.26.12                  pypi_0    pypi
werkzeug                  2.2.2                    pypi_0    pypi
wheel                     0.37.1             pyhd3eb1b0_0  
xxhash                    3.0.0                    pypi_0    pypi
xz                        5.2.5                h7f8727e_1  
yacs                      0.1.8                    pypi_0    pypi
yaml                      0.2.5                h7b6447c_0  
zipp                      3.8.1                    pypi_0    pypi
zlib                      1.2.12               h5eee18b_3  
zstd                      1.4.9                haebb681_0 

nvidia-smi

 NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   76C    P0    33W /  70W |   5874MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+

If I upgrade to pytorch 1.8 with cuda 11.1 then the error is Cuda Invalid device ordinal. Trying to setup this environment from last 3 days, tried various combinations of versions none worked. Can you provide a list of dependencies with the exact versions where it can work in a new instance of Ubuntu 18.04.

@jpWang
Copy link
Owner

jpWang commented Sep 13, 2022

Hi,
maybe you should try to change --nproc_per_node=4 to --nproc_per_node=1.

@unreal91
Copy link
Author

@jpWang thanks a lot for the suggestion, it worked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants