-
Notifications
You must be signed in to change notification settings - Fork 367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get this error on run_sft.py when calling "trainer.push_to_hub": [Rank 0] Watchdog caught collective operation timeout #59
Comments
Hi @ohmeow this looks like an issue with the model taking too long to push to the Hub before the 30min timeout from One thing you can do is tweak the timeout when the accelerator is instantiated as follows, e.g. # Increase distributed timeout to 3h to enable push to Hub to complete
accelerator = Accelerator(kwargs_handlers=[InitProcessGroupKwargs(timeout=timedelta(seconds=6 * 1800))]) |
I'll try that. What's funny is that it looks like all the file get uploaded ... it just gets stuck and eventually times out. |
Same here, everything's pushed to the HuggingFace Hub after fine-tuning but then the run crashes for no reason, so removing the integrated |
Thanks for checking @alvarobartt - this is very strange and I can't reproduce on my setup 🤔 . On how many nodes / GPUs are you running on? |
I think that the problem is that evaluation is fairly long is beyond 30 min timeout. It then should reproduce on low GPU count. Moreover I wasn't able to increase the timeout by passing parameter to Accelerate as proposed |
I tried out your suggestion to further explore that because was seeing the same when # Increase distributed timeout to 3h to enable push to Hub to complete
accelerator = Accelerator(kwargs_handlers=[InitProcessGroupKwargs(timeout=timedelta(seconds=6 * 1800))]) But it kept on failing on 8 x A100 both 40Gb and 80Gb, even failed in 8 x H100 80Gb, I adjusted the timeouts so that the fine-tunes could be pushed to the Hub, but got no success even though everything was pushed indeed. |
Hi folks, I was able to repro the issue and AFAICT it only happens for full training (i.e. with ZeRO-3) and not with QLoRA (DDP). The solution I've implemented in the linked PR above is to pull the |
Here's the call I'm using to run the script:
Here's the full trace of the error:
Any ideas how to resolve?
Thanks
The text was updated successfully, but these errors were encountered: