No effect from InitProcessGroupKwargs timeout #1403

Randl · 2024-03-07T18:49:12Z

Followup from huggingface/accelerate#2236 (comment)
cc @muellerzr

I'll copy main text from there, and there are some more details in discussion

System Info

- `Accelerate` version: 0.23.0
- Platform: Linux-6.2.0-37-generic-x86_64-with-glibc2.35
- Python version: 3.10.13
- Numpy version: 1.26.2
- PyTorch version (GPU?): 2.1.1+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 62.65 GB
- GPU type: NVIDIA RTX 6000 Ada Generation
- `Accelerate` default config:
	Not found

Reproduction

Follow instructions from https://github.com/huggingface/alignment-handbook/tree/main/scripts. Install the environment to run lora sft training
Change the timeout to 3 hours:

accelerator = Accelerator(kwargs_handlers=[InitProcessGroupKwargs(timeout=timedelta(seconds=6 * 1800))])

and run the training
3. Get crash due to timeout: https://wandb.ai/evgeniizh/huggingface/runs/pskgg48d

[E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1124292, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800584 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1124292, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800584 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1124292, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800584 milliseconds before timing out.
[2023-12-09 08:46:08,664] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 54784 closing signal SIGTERM
[2023-12-09 08:46:11,834] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 1 (pid: 54785) of binary: /home/evgenii/.conda/envs/handbook/bin/python
Traceback (most recent call last):
  File "/home/evgenii/.conda/envs/handbook/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/evgenii/.conda/envs/handbook/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/home/evgenii/.conda/envs/handbook/lib/python3.10/site-packages/accelerate/commands/launch.py", line 971, in launch_command
    deepspeed_launcher(args)
  File "/home/evgenii/.conda/envs/handbook/lib/python3.10/site-packages/accelerate/commands/launch.py", line 687, in deepspeed_launcher
    distrib_run.run(args)
  File "/home/evgenii/.conda/envs/handbook/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/home/evgenii/.conda/envs/handbook/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/evgenii/.conda/envs/handbook/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
======================================================
scripts/run_sft.py FAILED

Note that timeout is still 1800 secconds
(see also huggingface/alignment-handbook#59)

Expected behavior

Timeout is increased, and no crush.

The text was updated successfully, but these errors were encountered:

muellerzr · 2024-03-07T19:17:01Z

Actually, @Randl when are you doing this in your code?

accelerator = Accelerator(kwargs_handlers=[InitProcessGroupKwargs(timeout=timedelta(seconds=6 * 1800))])

And what is your full code?

(I still think it may be a TRL issue, but I need that to be 100% sure)

muellerzr · 2024-03-07T19:35:41Z

I may have found the solution.

@Randl can you try again (I know it'll take awhile to run), installing transformers via pip install git+https://github.com/huggingface/transformers@muellerzr-fix-timeout?

Finally narrowed it down.

Randl · 2024-03-07T19:37:02Z

I'll update you when I run it

muellerzr · 2024-03-20T16:35:51Z

@Randl were you able to try it out? 🤗

github-actions · 2024-04-14T15:28:55Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

thepowerfuldeez · 2024-07-29T17:53:54Z

Wondering if this was addressed?

muellerzr mentioned this issue Mar 7, 2024

Fix timeout propagation for world_process_zero things huggingface/transformers#29523

Closed

5 tasks

This was referenced Mar 8, 2024

dpo train with llama 2 70b timeout #1405

Closed

dpo train with llama 2 70b timeout huggingface/accelerate#2536

Closed

github-actions bot closed this as completed Apr 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No effect from InitProcessGroupKwargs timeout #1403

No effect from InitProcessGroupKwargs timeout #1403

Randl commented Mar 7, 2024

muellerzr commented Mar 7, 2024

muellerzr commented Mar 7, 2024

Randl commented Mar 7, 2024

muellerzr commented Mar 20, 2024

github-actions bot commented Apr 14, 2024

thepowerfuldeez commented Jul 29, 2024

No effect from InitProcessGroupKwargs timeout #1403

No effect from InitProcessGroupKwargs timeout #1403

Comments

Randl commented Mar 7, 2024

System Info

Reproduction

Expected behavior

muellerzr commented Mar 7, 2024

muellerzr commented Mar 7, 2024

Randl commented Mar 7, 2024

muellerzr commented Mar 20, 2024

github-actions bot commented Apr 14, 2024

thepowerfuldeez commented Jul 29, 2024