Upstream DeepSpeed breaks HF conversion script #750

haileyschoelkopf · 2022-12-20T15:26:41Z

The tools/convert_to_hf.py script will need to be updated / a different version may need to be created for checkpoints saved with DeepSpeed. Checkpoints are no longer saved layer-by-layer, it seems, and now all weights are in several mp_rank_{MP_RANK}_model_states.pt files for each Model Parallel partition.

Upstream DeepSpeed checkpoint:

drwxr-xr-x 2 hailey eleuther     33280 Dec 18 14:29 configs
-rw-r--r-- 1 hailey eleuther 810771646 Dec 18 14:29 mp_rank_00_model_states.pt
-rw-r--r-- 1 hailey eleuther 608006863 Dec 18 14:29 zero_pp_rank_0_mp_rank_00_optim_states.pt
-rw-r--r-- 1 hailey eleuther 608008079 Dec 18 14:29 zero_pp_rank_1_mp_rank_00_optim_states.pt
-rw-r--r-- 1 hailey eleuther 608008079 Dec 18 14:29 zero_pp_rank_2_mp_rank_00_optim_states.pt
-rw-r--r-- 1 hailey eleuther 608008143 Dec 18 14:29 zero_pp_rank_3_mp_rank_00_optim_states.pt
-rw-r--r-- 1 hailey eleuther 608008143 Dec 18 14:29 zero_pp_rank_4_mp_rank_00_optim_states.pt
-rw-r--r-- 1 hailey eleuther 608008143 Dec 18 14:29 zero_pp_rank_5_mp_rank_00_optim_states.pt
-rw-r--r-- 1 hailey eleuther 608008079 Dec 18 14:29 zero_pp_rank_6_mp_rank_00_optim_states.pt
-rw-r--r-- 1 hailey eleuther 608006863 Dec 18 14:29 zero_pp_rank_7_mp_rank_00_optim_states.pt

DeeperSpeed checkpoint:

drwxrwxrwx 2 hailey eleuther     33280 Nov 18 04:55 configs
-rwxrwxrwx 1 hailey eleuther 206045931 Nov 18 04:55 layer_00-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_02-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_03-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_04-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_05-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_06-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_07-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_08-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_09-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_10-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_11-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_12-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_13-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_14-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_15-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_16-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_17-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_18-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_19-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_20-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_21-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_22-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_23-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_24-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_25-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther      9127 Nov 18 04:55 layer_27-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 206045931 Nov 18 04:55 layer_28-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther     16291 Nov 18 04:55 mp_rank_00_model_states.pt
-rwxrwxrwx 1 hailey eleuther 287605953 Nov 18 04:55 zero_pp_rank_0_mp_rank_00_optim_states.pt
-rwxrwxrwx 1 hailey eleuther 287605953 Nov 18 04:55 zero_pp_rank_10_mp_rank_00_optim_states.pt
-rwxrwxrwx 1 hailey eleuther 287605953 Nov 18 04:55 zero_pp_rank_11_mp_rank_00_optim_states.pt
-rwxrwxrwx 1 hailey eleuther 287605953 Nov 18 04:55 zero_pp_rank_12_mp_rank_00_optim_states.pt
-rwxrwxrwx 1 hailey eleuther 287605953 Nov 18 04:55 zero_pp_rank_13_mp_rank_00_optim_states.pt
-rwxrwxrwx 1 hailey eleuther 287605953 Nov 18 04:55 zero_pp_rank_14_mp_rank_00_optim_states.pt
-rwxrwxrwx 1 hailey eleuther 287605953 Nov 18 04:55 zero_pp_rank_15_mp_rank_00_optim_states.pt
...

Updating the script shouldn't be too hard at all though.

The text was updated successfully, but these errors were encountered:

VHellendoorn · 2023-01-05T17:17:58Z

I'm facing a somewhat similar problem, so just want to check if it has the same root-cause before I open a new Issue for it.

When I train with NeoX (cloned a few months back) on two nodes with 8 GPUs each, the main node saves checkpoints with files for all layer states, as well as the mp_rank_00_model_states and the rank 0-7 optimizer states. The other node only saves zero optimizer states for ranks 8-15.

I am unable to resume training from these checkpoints; in the above state, the second machine skips loading checkpoints as Client provided checkpoint load path: [...] does not exist (as expected). If I copy the layer states and the mp_rank_00_model_states.pt file over to the second machine, it does load all layers from checkpoint, but the loss immediately spikes to very high values and never recovers. The only sign that something might be off comes from the below -- basically each machine complains about not finding ZeRO checkpoints that belong to GPUs on the other machine, which seems like it shouldn't be a problem.

main: [WARNING] [engine.py:1656:_get_all_zero_checkpoints] The following zero checkpoints paths are missing: ['checkpoints/global_step9000/zero_pp_rank_8_mp_rank_00_optim_states.pt', 'checkpoints/global_step9000/zero_pp_rank_9_mp_rank_00_optim_states.pt', ...]

Do you think #752 will fix this problem? Please let me know if you think this isn't related to the above and I'll make a separate issue!

-Vincent

StellaAthena · 2023-01-05T17:34:14Z

@VHellendoorn I think you may have intended to leave this comment on a different issue (#732), as this one is about the HF conversion script rather than the checkpoints themselves.

It sounds like the issue is that you are writing to a non-shared storage space, and I'm not sure if we've tested in that context. Because we use SLURM and Kubernetes to manage our runs on datacenter compute, we need to save our checkpoints to a storage device that's independent of the compute nodes if we want to actually be able to access them later. It seems plausible that a bug crept in unnoticed that only effects when you are writing to local storage though.

If I'm correct about what's going on, can you open an independent issue to discuss? It should be easy to replicate on our end, and if you can confirm that Megatron-DS doesn't have this problem that would be appreciated.

haileyschoelkopf · 2023-01-05T20:10:41Z

Yep, happy to discuss in a separate issue!

Seconding Stella's point, it sounds like non-shared storage between nodes may be the issue.

VHellendoorn · 2023-01-05T20:13:14Z

Thanks both for chiming in! @StellaAthena you're right, my bad; I'd clicked through to here from that issue. I'll make a new issue soon-ish then; might try to debug it a bit more now that I have a lead :)

haileyschoelkopf added the bug Something isn't working label Dec 20, 2022

StellaAthena modified the milestones: Library Release V2, Release V2 Dec 20, 2022

haileyschoelkopf mentioned this issue Dec 21, 2022

Upstream DeepSpeed -> HF checkpoint conversion script update #752

Merged

VHellendoorn mentioned this issue Jan 6, 2023

Multi-node training without shared memory #765

Open

StellaAthena mentioned this issue Jan 20, 2023

Unable to load model checkpoint with model parallelism #773

Open

Quentin-Anthony closed this as completed in #752 Mar 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upstream DeepSpeed breaks HF conversion script #750

Upstream DeepSpeed breaks HF conversion script #750

haileyschoelkopf commented Dec 20, 2022

VHellendoorn commented Jan 5, 2023

StellaAthena commented Jan 5, 2023

haileyschoelkopf commented Jan 5, 2023

VHellendoorn commented Jan 5, 2023

Upstream DeepSpeed breaks HF conversion script #750

Upstream DeepSpeed breaks HF conversion script #750

Comments

haileyschoelkopf commented Dec 20, 2022

VHellendoorn commented Jan 5, 2023

StellaAthena commented Jan 5, 2023

haileyschoelkopf commented Jan 5, 2023

VHellendoorn commented Jan 5, 2023