Training with one GPU #43

enrico310786 · 2022-03-31T05:16:50Z

Hi,
as you explain in the readme, you finetuned the pre-trained checkpoint using 8 A100 GPUs using a distributed package. Unfortunately i have just one GPU, thus i cannot use the distributed package. Thus i change the command from

python -m torch.distributed.run --nproc_per_node=8 train_retrieval.py
--config ./configs/retrieval_coco.yaml
--output_dir output/retrieval_coco

to

python train_retrieval.py
--config ./configs/retrieval_coco.yaml
--output_dir output/retrieval_coco

moreover i set the number of workers to 0 in the create_loader function and set the --distributed args in the main function to False, but i received the error "RuntimeError: Default process group has not been initialized, please make sure to call init_process_group".

Could you give me some hints?
Thanks

woctezuma · 2022-03-31T08:58:47Z

Have you tried setting nproc_per_node to 1?

python -m torch.distributed.run --nproc_per_node=1 train_retrieval.py \
--config ./configs/retrieval_coco.yaml \
--output_dir output/retrieval_coco

enrico310786 · 2022-03-31T09:19:34Z

Hi, no. I will try this option! If I want to use this option I have to leave the flag --distributed to True and the configuration understands itself what to do. Right? I will let you know. Thanks

enrico310786 · 2022-03-31T12:03:46Z

Hi, using your command and leaving --distributed=True the train is started. Thanks. What are the training times for the train_retrieval fine tuning on the coco dataset? My setup uses a GPU TeslaT4 and the training images of the coco dataset are of the order 500k. How long did you take for one epoch with 8 A100s on 500k images? If I have the ability to use 4 GPUs is it enough to set --nproc_per_node = 4?

woctezuma · 2022-03-31T13:03:24Z

Sorry, I don't know the answers to these questions.

LiJunnan1992 · 2022-03-31T23:51:13Z

Hi, using your command and leaving --distributed=True the train is started. Thanks. What are the training times for the train_retrieval fine tuning on the coco dataset? My setup uses a GPU TeslaT4 and the training images of the coco dataset are of the order 500k. How long did you take for one epoch with 8 A100s on 500k images? If I have the ability to use 4 GPUs is it enough to set --nproc_per_node = 4?

Finetuning on COCO with 8 A100s takes a few hours.

enrico310786 · 2022-04-01T04:58:10Z

Hi,

i try the train with 4 GPU TESLA T4 but i am getting these worning and errors

Rank 0 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. The same for the other four ranks
RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled system error, NCCL version 2.7.8

I am using torch==1.9.1+cu111 and torchvision==0.10.1+cu111. What version of Torch did you use? Did you also set other option in your machine? For example here Lightning-AI/pytorch-lightning#4420 (comment) set export NCCL_IB_DISABLE=1 and export NCCL_P2P_DISABLE=1 on A100s

LiJunnan1992 · 2022-04-03T14:26:57Z

I used pytorch 1.10

helleuch · 2022-04-22T13:29:31Z

Hello, just use torchrun instead: torchrun train_caption.py for e.g.

rocklee2022 · 2023-05-05T03:02:52Z

Hi, as you explain in the readme, you finetuned the pre-trained checkpoint using 8 A100 GPUs using a distributed package. Unfortunately i have just one GPU, thus i cannot use the distributed package. Thus i change the command from

python -m torch.distributed.run --nproc_per_node=8 train_retrieval.py --config ./configs/retrieval_coco.yaml --output_dir output/retrieval_coco

to

python train_retrieval.py --config ./configs/retrieval_coco.yaml --output_dir output/retrieval_coco

moreover i set the number of workers to 0 in the create_loader function and set the --distributed args in the main function to False, but i received the error "RuntimeError: Default process group has not been initialized, please make sure to call init_process_group".

Could you give me some hints? Thanks

for debug , you need to change 'all_gather ' function, just return input tensor

shams2023 · 2023-09-28T06:19:49Z

efault process group has not been initialized, please make sure to call init_process_group".

If you are not using distributed training, changing these two functions to this format can solve this problem.
The specific functions are located in:
blip_retrieval.py

Y-HuiMing-Y · 2024-01-16T08:48:30Z

Have you tried setting nproc_per_node to 1?

python -m torch.distributed.run --nproc_per_node=1 train_retrieval.py \
--config ./configs/retrieval_coco.yaml \
--output_dir output/retrieval_coco

Thank you for taking the time to look at my problem！
When I was training train_caption.py on windows, I set nproc_per_node=1 and default=False in main function, but I still got

"RuntimeError: Default process group has not been initialized, please make sure to call init_process_group "

Is there any need to set?
Thank you very much for your help!

Fir1314 · 2024-03-31T13:24:31Z

Have you tried setting nproc_per_node to 1?
python -m torch.distributed.run --nproc_per_node=1 train_retrieval.py \
--config ./configs/retrieval_coco.yaml \
--output_dir output/retrieval_coco 
Thank you for taking the time to look at my problem！ When I was training train_caption.py on windows, I set nproc_per_node=1 and default=False in main function, but I still got

"RuntimeError: Default process group has not been initialized, please make sure to call init_process_group "

Is there any need to set? Thank you very much for your help!

i have same question，May I ask if you have resolved it？

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training with one GPU #43

Training with one GPU #43

enrico310786 commented Mar 31, 2022 •

edited

Loading

woctezuma commented Mar 31, 2022 •

edited

Loading

enrico310786 commented Mar 31, 2022

enrico310786 commented Mar 31, 2022 •

edited

Loading

woctezuma commented Mar 31, 2022

LiJunnan1992 commented Mar 31, 2022

enrico310786 commented Apr 1, 2022 •

edited

Loading

LiJunnan1992 commented Apr 3, 2022

helleuch commented Apr 22, 2022

rocklee2022 commented May 5, 2023

shams2023 commented Sep 28, 2023

Y-HuiMing-Y commented Jan 16, 2024

Fir1314 commented Mar 31, 2024

Training with one GPU #43

Training with one GPU #43

Comments

enrico310786 commented Mar 31, 2022 • edited Loading

woctezuma commented Mar 31, 2022 • edited Loading

enrico310786 commented Mar 31, 2022

enrico310786 commented Mar 31, 2022 • edited Loading

woctezuma commented Mar 31, 2022

LiJunnan1992 commented Mar 31, 2022

enrico310786 commented Apr 1, 2022 • edited Loading

LiJunnan1992 commented Apr 3, 2022

helleuch commented Apr 22, 2022

rocklee2022 commented May 5, 2023

shams2023 commented Sep 28, 2023

Y-HuiMing-Y commented Jan 16, 2024

Fir1314 commented Mar 31, 2024

enrico310786 commented Mar 31, 2022 •

edited

Loading

woctezuma commented Mar 31, 2022 •

edited

Loading

enrico310786 commented Mar 31, 2022 •

edited

Loading

enrico310786 commented Apr 1, 2022 •

edited

Loading