Hangs up when finishing up a medium model training #473

sameeravithana · 2021-12-01T01:54:19Z

Describe the bug
It hangs up when finishing up a model training with the default medium.yaml. No issue was discovered with the small.yaml.

Screenshots

-----------------------------------------------------------------------------------------------------------
 validation results at iteration 320000 | lm_loss value: 2.567550E+00 | lm_loss_ppl value: 1.303385E+01 |
-----------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------
 validation results at the end of training for val data | lm_loss value: 2.536536E+00 | lm_loss_ppl value: 1.263582E+01 |
---------------------------------------------------------------------------------------------------------------------------
[2021-11-30 11:32:23,930] [INFO] [logging.py:60:log_dist] [Rank 0] Saving model checkpoint: ../mp_rank_00_model_states.pt
[2021-11-30 11:32:43,555] [INFO] [engine.py:1805:_copy_recovery_script] creating recovery script ../ero_to_fp32.py
[2021-11-30 11:32:43,567] [INFO] [engine.py:1818:_save_zero_checkpoint] zero checkpoint saved ../zero_pp_rank_0_mp_rank_00_optim_states.pt
[2021-11-30 11:32:43,668] [INFO] [engine.py:1805:_copy_recovery_script] creating recovery script ../zero_to_fp32.py
[2021-11-30 11:32:43,676] [INFO] [engine.py:1818:_save_zero_checkpoint] zero checkpoint saved ../zero_pp_rank_3_mp_rank_00_optim_states.pt
[2021-11-30 11:32:43,807] [INFO] [engine.py:1805:_copy_recovery_script] creating recovery script ../zero_to_fp32.py
[2021-11-30 11:32:43,821] [INFO] [engine.py:1818:_save_zero_checkpoint] zero checkpoint saved ../zero_pp_rank_2_mp_rank_00_optim_states.pt
[2021-11-30 11:32:43,840] [INFO] [engine.py:1805:_copy_recovery_script] creating recovery script ../zero_to_fp32.py
[2021-11-30 11:32:43,848] [INFO] [engine.py:1818:_save_zero_checkpoint] zero checkpoint saved ../zero_pp_rank_1_mp_rank_00_optim_states.pt

The text was updated successfully, but these errors were encountered:

sameeravithana · 2021-12-02T05:51:40Z

This hangs up resulted a segmentation fault at the end in multi-gpu settings. only in medium.yml as tested so far.

Traceback (most recent call last):
  File "../lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "../lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "../lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 179, in <module>
    main()
  File "../lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 169, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "../lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 147, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['../bin/python', '-u', 'train.py'...

died with <Signals.SIGSEGV: 11>.

sdtblck · 2021-12-12T21:44:31Z

Hi @SamTube405 I think that this is an issue with how deepspeed closes processes. Is this on a single node, or across multiple nodes? Could you post a config we can reproduce this with? (ideally with environment details (machines, pip freeze etc.) + a small number of steps so we don't have to wait too long to get to the error.)

sameeravithana · 2021-12-15T05:23:15Z

Hi @sdtblck, we discovered this issue in the single node/multi gpu setup default medium.yaml (set the # iterations to a low value like 100), as you pointed out, it could be due to the how deepspeed closes the processes at the end of the training. The python environment is as same as one pointed in the installation guide, apex was installed in addition.

sdtblck · 2021-12-17T13:30:42Z

Huh, I've personally only ever experienced training hanging on multiple nodes.

Actually Signals.SIGSEGV: 11 suggests a segfault. Is this something you can reproduce repeatedly with the medium.yml configuration at 100 steps?

sameeravithana · 2022-01-07T15:55:00Z

This resolved with an update to DeepSpeed.

sameeravithana added the bug Something isn't working label Dec 1, 2021

sameeravithana changed the title ~~Hangs up when finishing up a medium model~~ Hangs up when finishing up a medium model training Dec 1, 2021

sameeravithana closed this as completed Jan 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hangs up when finishing up a medium model training #473

Hangs up when finishing up a medium model training #473

sameeravithana commented Dec 1, 2021 •

edited

Loading

sameeravithana commented Dec 2, 2021 •

edited

Loading

sdtblck commented Dec 12, 2021

sameeravithana commented Dec 15, 2021

sdtblck commented Dec 17, 2021

sameeravithana commented Jan 7, 2022

Hangs up when finishing up a medium model training #473

Hangs up when finishing up a medium model training #473

Comments

sameeravithana commented Dec 1, 2021 • edited Loading

sameeravithana commented Dec 2, 2021 • edited Loading

sdtblck commented Dec 12, 2021

sameeravithana commented Dec 15, 2021

sdtblck commented Dec 17, 2021

sameeravithana commented Jan 7, 2022

sameeravithana commented Dec 1, 2021 •

edited

Loading

sameeravithana commented Dec 2, 2021 •

edited

Loading