Get this error on run_sft.py when calling "trainer.push_to_hub": [Rank 0] Watchdog caught collective operation timeout #59

ohmeow · 2023-12-01T08:48:04Z

Here's the call I'm using to run the script:

ACCELERATE_LOG_LEVEL=info accelerate launch --config_file examples/hf-alignment-handbook/configs/accelerate_configs/deepspeed_zero3.yaml --num_processes=2 examples/hf-alignment-handbook/run_sft.py examples/hf-alignment-handbook/configs/training_configs/zephyr-7b-beta/config_lora_sft.yaml --load_in_4bit=true

Here's the full trace of the error:

2023-12-01 00:05:43 - INFO - __main__ - Pushing to hub...
[E ProcessGroupNCCL.cpp:475] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=130722, OpType=ALLGATHER, NumelIn=65536000, NumelOut=131072000, Timeout(ms)=1800000) ran for 1800460 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=130722, OpType=ALLGATHER, NumelIn=65536000, NumelOut=131072000, Timeout(ms)=1800000) ran for 1800460 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=130722, OpType=ALLGATHER, NumelIn=65536000, NumelOut=131072000, Timeout(ms)=1800000) ran for 1800460 milliseconds before timing out.
[2023-12-01 00:35:50,246] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 0 (pid: 1817792) of binary: /home/wgilliam/mambaforge/envs/llms/bin/python3.11
Traceback (most recent call last):
  File "/home/wgilliam/mambaforge/envs/llms/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/wgilliam/mambaforge/envs/llms/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/home/wgilliam/mambaforge/envs/llms/lib/python3.11/site-packages/accelerate/commands/launch.py", line 979, in launch_command
    deepspeed_launcher(args)
  File "/home/wgilliam/mambaforge/envs/llms/lib/python3.11/site-packages/accelerate/commands/launch.py", line 695, in deepspeed_launcher
    distrib_run.run(args)
  File "/home/wgilliam/mambaforge/envs/llms/lib/python3.11/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/home/wgilliam/mambaforge/envs/llms/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/wgilliam/mambaforge/envs/llms/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Any ideas how to resolve?

Thanks

The text was updated successfully, but these errors were encountered:

lewtun · 2023-12-01T11:06:29Z

Hi @ohmeow this looks like an issue with the model taking too long to push to the Hub before the 30min timeout from accelerate kicked in - you by any chance know if your upload speed was bottlenecked?

One thing you can do is tweak the timeout when the accelerator is instantiated as follows, e.g.

  # Increase distributed timeout to 3h to enable push to Hub to complete
  accelerator = Accelerator(kwargs_handlers=[InitProcessGroupKwargs(timeout=timedelta(seconds=6 * 1800))])

ohmeow · 2023-12-01T17:11:57Z

I'll try that. What's funny is that it looks like all the file get uploaded ... it just gets stuck and eventually times out.

alvarobartt · 2023-12-15T10:12:36Z

Same here, everything's pushed to the HuggingFace Hub after fine-tuning but then the run crashes for no reason, so removing the integrated push_to_hub temporarily and running it manually to avoid the run from crashing (even if succeeding)

lewtun · 2023-12-15T14:04:53Z

Thanks for checking @alvarobartt - this is very strange and I can't reproduce on my setup 🤔 . On how many nodes / GPUs are you running on?

Randl · 2023-12-15T14:13:42Z

I think that the problem is that evaluation is fairly long is beyond 30 min timeout. It then should reproduce on low GPU count.

Moreover I wasn't able to increase the timeout by passing parameter to Accelerate as proposed

alvarobartt · 2023-12-15T15:39:13Z

Thanks for checking @alvarobartt - this is very strange and I can't reproduce on my setup 🤔 . On how many nodes / GPUs are you running on?

I tried out your suggestion to further explore that because was seeing the same when push_to_hub=True, see your suggestion below:

  # Increase distributed timeout to 3h to enable push to Hub to complete
  accelerator = Accelerator(kwargs_handlers=[InitProcessGroupKwargs(timeout=timedelta(seconds=6 * 1800))])

But it kept on failing on 8 x A100 both 40Gb and 80Gb, even failed in 8 x H100 80Gb, I adjusted the timeouts so that the fine-tunes could be pushed to the Hub, but got no success even though everything was pushed indeed.

lewtun · 2024-01-09T04:36:58Z

Hi folks, I was able to repro the issue and AFAICT it only happens for full training (i.e. with ZeRO-3) and not with QLoRA (DDP).

The solution I've implemented in the linked PR above is to pull the push_to_hub() call outside the main process since this seems to be the source of conflict between the trainer internals which have their own checks to see which process this is being run from. Let me know if that helps once #88 is merged!

Randl mentioned this issue Dec 10, 2023

No effect from InitProcessGroupKwargs timeout huggingface/accelerate#2236

Closed

4 tasks

lewtun mentioned this issue Jan 9, 2024

Update Zephyr configs to account for UltraFeedback & TRL fixes #88

Merged

4 tasks

Randl mentioned this issue Mar 7, 2024

No effect from InitProcessGroupKwargs timeout huggingface/trl#1403

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get this error on run_sft.py when calling "trainer.push_to_hub": [Rank 0] Watchdog caught collective operation timeout #59

Get this error on run_sft.py when calling "trainer.push_to_hub": [Rank 0] Watchdog caught collective operation timeout #59

ohmeow commented Dec 1, 2023

lewtun commented Dec 1, 2023

ohmeow commented Dec 1, 2023

alvarobartt commented Dec 15, 2023

lewtun commented Dec 15, 2023

Randl commented Dec 15, 2023

alvarobartt commented Dec 15, 2023

lewtun commented Jan 9, 2024

Get this error on run_sft.py when calling "trainer.push_to_hub": [Rank 0] Watchdog caught collective operation timeout #59

Get this error on run_sft.py when calling "trainer.push_to_hub": [Rank 0] Watchdog caught collective operation timeout #59

Comments

ohmeow commented Dec 1, 2023

lewtun commented Dec 1, 2023

ohmeow commented Dec 1, 2023

alvarobartt commented Dec 15, 2023

lewtun commented Dec 15, 2023

Randl commented Dec 15, 2023

alvarobartt commented Dec 15, 2023

lewtun commented Jan 9, 2024