Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hangs up when finishing up a medium model training #473

Closed
sameeravithana opened this issue Dec 1, 2021 · 5 comments
Closed

Hangs up when finishing up a medium model training #473

sameeravithana opened this issue Dec 1, 2021 · 5 comments
Labels
bug Something isn't working

Comments

@sameeravithana
Copy link

sameeravithana commented Dec 1, 2021

Describe the bug
It hangs up when finishing up a model training with the default medium.yaml. No issue was discovered with the small.yaml.

Screenshots

-----------------------------------------------------------------------------------------------------------
 validation results at iteration 320000 | lm_loss value: 2.567550E+00 | lm_loss_ppl value: 1.303385E+01 |
-----------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------
 validation results at the end of training for val data | lm_loss value: 2.536536E+00 | lm_loss_ppl value: 1.263582E+01 |
---------------------------------------------------------------------------------------------------------------------------
[2021-11-30 11:32:23,930] [INFO] [logging.py:60:log_dist] [Rank 0] Saving model checkpoint: ../mp_rank_00_model_states.pt
[2021-11-30 11:32:43,555] [INFO] [engine.py:1805:_copy_recovery_script] creating recovery script ../ero_to_fp32.py
[2021-11-30 11:32:43,567] [INFO] [engine.py:1818:_save_zero_checkpoint] zero checkpoint saved ../zero_pp_rank_0_mp_rank_00_optim_states.pt
[2021-11-30 11:32:43,668] [INFO] [engine.py:1805:_copy_recovery_script] creating recovery script ../zero_to_fp32.py
[2021-11-30 11:32:43,676] [INFO] [engine.py:1818:_save_zero_checkpoint] zero checkpoint saved ../zero_pp_rank_3_mp_rank_00_optim_states.pt
[2021-11-30 11:32:43,807] [INFO] [engine.py:1805:_copy_recovery_script] creating recovery script ../zero_to_fp32.py
[2021-11-30 11:32:43,821] [INFO] [engine.py:1818:_save_zero_checkpoint] zero checkpoint saved ../zero_pp_rank_2_mp_rank_00_optim_states.pt
[2021-11-30 11:32:43,840] [INFO] [engine.py:1805:_copy_recovery_script] creating recovery script ../zero_to_fp32.py
[2021-11-30 11:32:43,848] [INFO] [engine.py:1818:_save_zero_checkpoint] zero checkpoint saved ../zero_pp_rank_1_mp_rank_00_optim_states.pt
@sameeravithana sameeravithana added the bug Something isn't working label Dec 1, 2021
@sameeravithana sameeravithana changed the title Hangs up when finishing up a medium model Hangs up when finishing up a medium model training Dec 1, 2021
@sameeravithana
Copy link
Author

sameeravithana commented Dec 2, 2021

This hangs up resulted a segmentation fault at the end in multi-gpu settings. only in medium.yml as tested so far.

Traceback (most recent call last):
  File "../lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "../lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "../lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 179, in <module>
    main()
  File "../lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 169, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "../lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 147, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['../bin/python', '-u', 'train.py'...

died with <Signals.SIGSEGV: 11>.

@sdtblck
Copy link
Contributor

sdtblck commented Dec 12, 2021

Hi @SamTube405 I think that this is an issue with how deepspeed closes processes. Is this on a single node, or across multiple nodes? Could you post a config we can reproduce this with? (ideally with environment details (machines, pip freeze etc.) + a small number of steps so we don't have to wait too long to get to the error.)

@sameeravithana
Copy link
Author

Hi @sdtblck, we discovered this issue in the single node/multi gpu setup default medium.yaml (set the # iterations to a low value like 100), as you pointed out, it could be due to the how deepspeed closes the processes at the end of the training. The python environment is as same as one pointed in the installation guide, apex was installed in addition.

@sdtblck
Copy link
Contributor

sdtblck commented Dec 17, 2021

Huh, I've personally only ever experienced training hanging on multiple nodes.

Actually Signals.SIGSEGV: 11 suggests a segfault. Is this something you can reproduce repeatedly with the medium.yml configuration at 100 steps?

@sameeravithana
Copy link
Author

This resolved with an update to DeepSpeed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants