-
Notifications
You must be signed in to change notification settings - Fork 592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training with one GPU #43
Comments
Have you tried setting python -m torch.distributed.run --nproc_per_node=1 train_retrieval.py \
--config ./configs/retrieval_coco.yaml \
--output_dir output/retrieval_coco |
Hi, no. I will try this option! If I want to use this option I have to leave the flag --distributed to True and the configuration understands itself what to do. Right? I will let you know. Thanks |
Hi, using your command and leaving --distributed=True the train is started. Thanks. What are the training times for the train_retrieval fine tuning on the coco dataset? My setup uses a GPU TeslaT4 and the training images of the coco dataset are of the order 500k. How long did you take for one epoch with 8 A100s on 500k images? If I have the ability to use 4 GPUs is it enough to set --nproc_per_node = 4? |
Sorry, I don't know the answers to these questions. |
Finetuning on COCO with 8 A100s takes a few hours. |
Hi, i try the train with 4 GPU TESLA T4 but i am getting these worning and errors
I am using torch==1.9.1+cu111 and torchvision==0.10.1+cu111. What version of Torch did you use? Did you also set other option in your machine? For example here Lightning-AI/pytorch-lightning#4420 (comment) set export NCCL_IB_DISABLE=1 and export NCCL_P2P_DISABLE=1 on A100s |
I used pytorch 1.10 |
Hello, just use torchrun instead: |
for debug , you need to change 'all_gather ' function, just return input tensor |
Thank you for taking the time to look at my problem!
Is there any need to set? |
i have same question,May I ask if you have resolved it? |
Hi,
as you explain in the readme, you finetuned the pre-trained checkpoint using 8 A100 GPUs using a distributed package. Unfortunately i have just one GPU, thus i cannot use the distributed package. Thus i change the command from
python -m torch.distributed.run --nproc_per_node=8 train_retrieval.py
--config ./configs/retrieval_coco.yaml
--output_dir output/retrieval_coco
to
python train_retrieval.py
--config ./configs/retrieval_coco.yaml
--output_dir output/retrieval_coco
moreover i set the number of workers to 0 in the create_loader function and set the --distributed args in the main function to False, but i received the error "RuntimeError: Default process group has not been initialized, please make sure to call init_process_group".
Could you give me some hints?
Thanks
The text was updated successfully, but these errors were encountered: