Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No effect from InitProcessGroupKwargs timeout #1403

Closed
Randl opened this issue Mar 7, 2024 · 6 comments
Closed

No effect from InitProcessGroupKwargs timeout #1403

Randl opened this issue Mar 7, 2024 · 6 comments

Comments

@Randl
Copy link

Randl commented Mar 7, 2024

Followup from huggingface/accelerate#2236 (comment)
cc @muellerzr

I'll copy main text from there, and there are some more details in discussion

System Info

- `Accelerate` version: 0.23.0
- Platform: Linux-6.2.0-37-generic-x86_64-with-glibc2.35
- Python version: 3.10.13
- Numpy version: 1.26.2
- PyTorch version (GPU?): 2.1.1+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 62.65 GB
- GPU type: NVIDIA RTX 6000 Ada Generation
- `Accelerate` default config:
	Not found

Reproduction

  1. Follow instructions from https://github.com/huggingface/alignment-handbook/tree/main/scripts. Install the environment to run lora sft training
  2. Change the timeout to 3 hours:
accelerator = Accelerator(kwargs_handlers=[InitProcessGroupKwargs(timeout=timedelta(seconds=6 * 1800))])

and run the training
3. Get crash due to timeout: https://wandb.ai/evgeniizh/huggingface/runs/pskgg48d

[E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1124292, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800584 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1124292, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800584 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1124292, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800584 milliseconds before timing out.
[2023-12-09 08:46:08,664] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 54784 closing signal SIGTERM
[2023-12-09 08:46:11,834] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 1 (pid: 54785) of binary: /home/evgenii/.conda/envs/handbook/bin/python
Traceback (most recent call last):
  File "/home/evgenii/.conda/envs/handbook/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/evgenii/.conda/envs/handbook/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/home/evgenii/.conda/envs/handbook/lib/python3.10/site-packages/accelerate/commands/launch.py", line 971, in launch_command
    deepspeed_launcher(args)
  File "/home/evgenii/.conda/envs/handbook/lib/python3.10/site-packages/accelerate/commands/launch.py", line 687, in deepspeed_launcher
    distrib_run.run(args)
  File "/home/evgenii/.conda/envs/handbook/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/home/evgenii/.conda/envs/handbook/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/evgenii/.conda/envs/handbook/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
======================================================
scripts/run_sft.py FAILED

Note that timeout is still 1800 secconds
(see also huggingface/alignment-handbook#59)

Expected behavior

Timeout is increased, and no crush.

@muellerzr
Copy link
Contributor

Actually, @Randl when are you doing this in your code?

accelerator = Accelerator(kwargs_handlers=[InitProcessGroupKwargs(timeout=timedelta(seconds=6 * 1800))])

And what is your full code?

(I still think it may be a TRL issue, but I need that to be 100% sure)

@muellerzr
Copy link
Contributor

I may have found the solution.

@Randl can you try again (I know it'll take awhile to run), installing transformers via pip install git+https://github.com/huggingface/transformers@muellerzr-fix-timeout?

Finally narrowed it down.

@Randl
Copy link
Author

Randl commented Mar 7, 2024

I'll update you when I run it

@muellerzr
Copy link
Contributor

@Randl were you able to try it out? 🤗

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

@thepowerfuldeez
Copy link

Wondering if this was addressed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants