Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed training raise an error #230

Open
mkhoshle opened this issue May 5, 2022 · 8 comments
Open

Distributed training raise an error #230

mkhoshle opened this issue May 5, 2022 · 8 comments

Comments

@mkhoshle
Copy link

mkhoshle commented May 5, 2022

Hi,

I am trying to run Romp in distributed mode. I follow this Script. Since there is no folder called core in the repository I replaced it with romp. However, when I run the code it raises the error that there is no file called train.py. How can I avoid this error?

Thanks

@Arthur151
Copy link
Owner

Arthur151 commented May 6, 2022

Thanks for the bug report.
Please replace it as
romp.train
like this
https://github.com/Arthur151/ROMP/blob/master/scripts/train_distributed.sh

@mkhoshle
Copy link
Author

mkhoshle commented May 6, 2022

@Arthur151 I did try romp.train. Even with that I get the error. Here is what I get:

*****************************************
/z/home/mahzad-khosh/env/romp/bin/python: can't open file 'romp.train': [Errno 2] No such file or directory
Traceback (most recent call last):
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/site-packages/torch/distributed/launch.py", line 261, in <module>
    main()
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/site-packages/torch/distributed/launch.py", line 256, in main
    raise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: Command '['/z/home/mahzad-khosh/env/romp/bin/python', '-u', 'romp.train', '--local_rank=3', '--GPUS=0,1,2,3', '--configs_yml=configs/v1_hrnet_3dpw_ft.yml', '--distributed_training=1']' returned non-zero exit status 2.
/z/home/mahzad-khosh/env/romp/bin/python: can't open file 'romp.train': [Errno 2] No such file or directory
/z/home/mahzad-khosh/env/romp/bin/python: can't open file 'romp.train': [Errno 2] No such file or directory
/z/home/mahzad-khosh/env/romp/bin/python: can't open file 'romp.train': [Errno 2] No such file or directory

What is the reason?

@Arthur151
Copy link
Owner

Arthur151 commented May 6, 2022

Oh, the command you use is different from what is on my rep.
Besides, please make sure that you run the code under ROMP folder.

CUDA_VISIBLE_DEVICES=${GPUS} nohup python -u -m torch.distributed.launch --nproc_per_node=4 romp.train --GPUS=${GPUS} --configs_yml=${TRAIN_CONFIGS} --distributed_training=1 > '../log/'${TAB}'_'${DATASET}'_g'${GPUS}.log 2>&1 &

Your command drops the -m config, which makes python search like a module.

Here is another way to achieve this, here is the format of command if you don't want to use the nohup

CUDA_VISIBLE_DEVICES=${GPUS} python -u torch.distributed.launch --nproc_per_node=4 /path/to/romp/train.py --GPUS=${GPUS} --configs_yml=${TRAIN_CONFIGS} --distributed_training=1

The key is to use the absolute path to train.py file.

@mkhoshle
Copy link
Author

mkhoshle commented May 6, 2022

When I use

CUDA_VISIBLE_DEVICES=${GPUS} python -u torch.distributed.launch --nproc_per_node=4 /path/to/romp/train.py --GPUS=${GPUS} --configs_yml=${TRAIN_CONFIGS} --distributed_training=1

I get the following error: python: can't open file 'torch.distributed.launch': [Errno 2] No such file or directory

When I run with this command:

CUDA_VISIBLE_DEVICES=${GPUS} nohup python -u -m torch.distributed.launch --nproc_per_node=4 romp.train --GPUS=${GPUS} --configs_yml=${TRAIN_CONFIGS} --distributed_training=1

I get this error:

/z/home/mahzad-khosh/env/romp/bin/python: can't open file 'romp.train': [Errno 2] No such file or directory
Traceback (most recent call last):
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/site-packages/torch/distributed/launch.py", line 261, in <module>
    main()
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/site-packages/torch/distributed/launch.py", line 256, in main
    raise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: Command '['/z/home/mahzad-khosh/env/romp/bin/python', '-u', 'romp.train', '--local_rank=3', '--GPUS=0,1,2,3', '--configs_yml=configs/v1_hrnet_3dpw_ft.yml', '--distributed_training=1']' returned non-zero exit status 2.
/z/home/mahzad-khosh/env/romp/bin/python: can't open file 'romp.train': [Errno 2] No such file or directory
/z/home/mahzad-khosh/env/romp/bin/python: can't open file 'romp.train': [Errno 2] No such file or directory
/z/home/mahzad-khosh/env/romp/bin/python: can't open file 'romp.train': [Errno 2] No such file or directory

@Arthur151
Copy link
Owner

It seems that torch.distributed.launch has been dropped in new version of Pytorch.
In latest version, they use the torchrun instead.
I have tested that this will work

CUDA_VISIBLE_DEVICES=${GPUS} nohup torchrun --nproc_per_node=4 -m romp.train --GPUS=${GPUS} --configs_yml=${TRAIN_CONFIGS} --distributed_training=1 > '../log/'${TAB}'_'${DATASET}'_g'${GPUS}.log 2>&1 &

@mkhoshle
Copy link
Author

mkhoshle commented May 9, 2022

@Arthur151 Ok replaced torch.distributed.launch with torchrun. When running my code I get the following error:

Fatal Python error: init_import_size: Failed to import the site module
Python runtime state: initialized
Error processing line 1 of /z/home/mahzad-khosh/env/romp/lib/python3.8/site-packages/google_auth-2.6.2-py3.10-nspkg.pth:

Traceback (most recent call last):
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/site.py", line 169, in addpackage
    exec(line)
  File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'types'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/site.py", line 580, in <module>
    main()
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/site.py", line 567, in main
    known_paths = addsitepackages(known_paths)
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/site.py", line 350, in addsitepackages
    addsitedir(sitedir, known_paths)
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/site.py", line 208, in addsitedir
    addpackage(sitedir, name, known_paths)
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/site.py", line 179, in addpackage
    import traceback
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/traceback.py", line 5, in <module>
    import linecache
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/linecache.py", line 11, in <module>
    import tokenize
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/tokenize.py", line 32, in <module>
    import re
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/re.py", line 124, in <module>
    import enum
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/enum.py", line 2, in <module>
    from types import MappingProxyType, DynamicClassAttribute
ModuleNotFoundError: No module named 'types'
Error processing line 1 of /z/home/mahzad-khosh/env/romp/lib/python3.8/site-packages/google_auth-2.6.2-py3.10-nspkg.pth:

Fatal Python error: init_import_size: Failed to import the site module
Python runtime state: initialized
Traceback (most recent call last):
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/site.py", line 169, in addpackage
    exec(line)
  File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'types'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/site.py", line 580, in <module>
    main()
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/site.py", line 567, in main
    known_paths = addsitepackages(known_paths)
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/site.py", line 350, in addsitepackages
    addsitedir(sitedir, known_paths)
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/site.py", line 208, in addsitedir
    addpackage(sitedir, name, known_paths)
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/site.py", line 179, in addpackage
    import traceback
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/traceback.py", line 5, in <module>
    import linecache
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/linecache.py", line 11, in <module>
    import tokenize
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/tokenize.py", line 32, in <module>
    import re
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/re.py", line 124, in <module>
    import enum
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/enum.py", line 2, in <module>
    from types import MappingProxyType, DynamicClassAttribute
ModuleNotFoundError: No module named 'types'

@Arthur151
Copy link
Owner

This is pretty weird. You don't have this basic python package?
please try import types

@Arthur151 Arthur151 reopened this May 9, 2022
@mkhoshle
Copy link
Author

mkhoshle commented May 9, 2022

@Arthur151 Ok I have cuda 10.2, pytorch==1.10.0, torchvision==0.11.1 and I am getting the error:
/z/home/mahzad-khosh/env/romp/bin/python: No module named torchrun.
My python version is 3.8.13.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants