Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training with one GPU #43

Open
enrico310786 opened this issue Mar 31, 2022 · 12 comments
Open

Training with one GPU #43

enrico310786 opened this issue Mar 31, 2022 · 12 comments

Comments

@enrico310786
Copy link

enrico310786 commented Mar 31, 2022

Hi,
as you explain in the readme, you finetuned the pre-trained checkpoint using 8 A100 GPUs using a distributed package. Unfortunately i have just one GPU, thus i cannot use the distributed package. Thus i change the command from

python -m torch.distributed.run --nproc_per_node=8 train_retrieval.py
--config ./configs/retrieval_coco.yaml
--output_dir output/retrieval_coco

to

python train_retrieval.py
--config ./configs/retrieval_coco.yaml
--output_dir output/retrieval_coco

moreover i set the number of workers to 0 in the create_loader function and set the --distributed args in the main function to False, but i received the error "RuntimeError: Default process group has not been initialized, please make sure to call init_process_group".

Could you give me some hints?
Thanks

@woctezuma
Copy link

woctezuma commented Mar 31, 2022

Have you tried setting nproc_per_node to 1?

python -m torch.distributed.run --nproc_per_node=1 train_retrieval.py \
--config ./configs/retrieval_coco.yaml \
--output_dir output/retrieval_coco 

@enrico310786
Copy link
Author

Hi, no. I will try this option! If I want to use this option I have to leave the flag --distributed to True and the configuration understands itself what to do. Right? I will let you know. Thanks

@enrico310786
Copy link
Author

enrico310786 commented Mar 31, 2022

Hi, using your command and leaving --distributed=True the train is started. Thanks. What are the training times for the train_retrieval fine tuning on the coco dataset? My setup uses a GPU TeslaT4 and the training images of the coco dataset are of the order 500k. How long did you take for one epoch with 8 A100s on 500k images? If I have the ability to use 4 GPUs is it enough to set --nproc_per_node = 4?

@woctezuma
Copy link

Sorry, I don't know the answers to these questions.

@LiJunnan1992
Copy link
Contributor

Hi, using your command and leaving --distributed=True the train is started. Thanks. What are the training times for the train_retrieval fine tuning on the coco dataset? My setup uses a GPU TeslaT4 and the training images of the coco dataset are of the order 500k. How long did you take for one epoch with 8 A100s on 500k images? If I have the ability to use 4 GPUs is it enough to set --nproc_per_node = 4?

Finetuning on COCO with 8 A100s takes a few hours.

@enrico310786
Copy link
Author

enrico310786 commented Apr 1, 2022

Hi,

i try the train with 4 GPU TESLA T4 but i am getting these worning and errors

  1. Rank 0 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. The same for the other four ranks

  2. RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled system error, NCCL version 2.7.8

I am using torch==1.9.1+cu111 and torchvision==0.10.1+cu111. What version of Torch did you use? Did you also set other option in your machine? For example here Lightning-AI/pytorch-lightning#4420 (comment) set export NCCL_IB_DISABLE=1 and export NCCL_P2P_DISABLE=1 on A100s

@LiJunnan1992
Copy link
Contributor

I used pytorch 1.10

@helleuch
Copy link

Hello, just use torchrun instead: torchrun train_caption.py for e.g.

@rocklee2022
Copy link

Hi, as you explain in the readme, you finetuned the pre-trained checkpoint using 8 A100 GPUs using a distributed package. Unfortunately i have just one GPU, thus i cannot use the distributed package. Thus i change the command from

python -m torch.distributed.run --nproc_per_node=8 train_retrieval.py --config ./configs/retrieval_coco.yaml --output_dir output/retrieval_coco

to

python train_retrieval.py --config ./configs/retrieval_coco.yaml --output_dir output/retrieval_coco

moreover i set the number of workers to 0 in the create_loader function and set the --distributed args in the main function to False, but i received the error "RuntimeError: Default process group has not been initialized, please make sure to call init_process_group".

Could you give me some hints? Thanks

for debug , you need to change 'all_gather ' function, just return input tensor

@shams2023
Copy link

efault process group has not been initialized, please make sure to call init_process_group".
image
image

If you are not using distributed training, changing these two functions to this format can solve this problem.
The specific functions are located in:
blip_retrieval.py

@Y-HuiMing-Y
Copy link

Have you tried setting nproc_per_node to 1?

python -m torch.distributed.run --nproc_per_node=1 train_retrieval.py \
--config ./configs/retrieval_coco.yaml \
--output_dir output/retrieval_coco 

Thank you for taking the time to look at my problem!
When I was training train_caption.py on windows, I set nproc_per_node=1 and default=False in main function, but I still got

"RuntimeError: Default process group has not been initialized, please make sure to call init_process_group "

Is there any need to set?
Thank you very much for your help!

@Fir1314
Copy link

Fir1314 commented Mar 31, 2024

Have you tried setting nproc_per_node to 1?

python -m torch.distributed.run --nproc_per_node=1 train_retrieval.py \
--config ./configs/retrieval_coco.yaml \
--output_dir output/retrieval_coco 

Thank you for taking the time to look at my problem! When I was training train_caption.py on windows, I set nproc_per_node=1 and default=False in main function, but I still got

"RuntimeError: Default process group has not been initialized, please make sure to call init_process_group "

Is there any need to set? Thank you very much for your help!

i have same question,May I ask if you have resolved it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants