Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upstream DeepSpeed breaks HF conversion script #750

Closed
haileyschoelkopf opened this issue Dec 20, 2022 · 4 comments · Fixed by #752
Closed

Upstream DeepSpeed breaks HF conversion script #750

haileyschoelkopf opened this issue Dec 20, 2022 · 4 comments · Fixed by #752
Labels
bug Something isn't working
Milestone

Comments

@haileyschoelkopf
Copy link
Contributor

The tools/convert_to_hf.py script will need to be updated / a different version may need to be created for checkpoints saved with DeepSpeed. Checkpoints are no longer saved layer-by-layer, it seems, and now all weights are in several mp_rank_{MP_RANK}_model_states.pt files for each Model Parallel partition.

Upstream DeepSpeed checkpoint:

drwxr-xr-x 2 hailey eleuther     33280 Dec 18 14:29 configs
-rw-r--r-- 1 hailey eleuther 810771646 Dec 18 14:29 mp_rank_00_model_states.pt
-rw-r--r-- 1 hailey eleuther 608006863 Dec 18 14:29 zero_pp_rank_0_mp_rank_00_optim_states.pt
-rw-r--r-- 1 hailey eleuther 608008079 Dec 18 14:29 zero_pp_rank_1_mp_rank_00_optim_states.pt
-rw-r--r-- 1 hailey eleuther 608008079 Dec 18 14:29 zero_pp_rank_2_mp_rank_00_optim_states.pt
-rw-r--r-- 1 hailey eleuther 608008143 Dec 18 14:29 zero_pp_rank_3_mp_rank_00_optim_states.pt
-rw-r--r-- 1 hailey eleuther 608008143 Dec 18 14:29 zero_pp_rank_4_mp_rank_00_optim_states.pt
-rw-r--r-- 1 hailey eleuther 608008143 Dec 18 14:29 zero_pp_rank_5_mp_rank_00_optim_states.pt
-rw-r--r-- 1 hailey eleuther 608008079 Dec 18 14:29 zero_pp_rank_6_mp_rank_00_optim_states.pt
-rw-r--r-- 1 hailey eleuther 608006863 Dec 18 14:29 zero_pp_rank_7_mp_rank_00_optim_states.pt

DeeperSpeed checkpoint:

drwxrwxrwx 2 hailey eleuther     33280 Nov 18 04:55 configs
-rwxrwxrwx 1 hailey eleuther 206045931 Nov 18 04:55 layer_00-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_02-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_03-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_04-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_05-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_06-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_07-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_08-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_09-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_10-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_11-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_12-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_13-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_14-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_15-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_16-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_17-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_18-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_19-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_20-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_21-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_22-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_23-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_24-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_25-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther      9127 Nov 18 04:55 layer_27-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther 206045931 Nov 18 04:55 layer_28-model_00-model_states.pt
-rwxrwxrwx 1 hailey eleuther     16291 Nov 18 04:55 mp_rank_00_model_states.pt
-rwxrwxrwx 1 hailey eleuther 287605953 Nov 18 04:55 zero_pp_rank_0_mp_rank_00_optim_states.pt
-rwxrwxrwx 1 hailey eleuther 287605953 Nov 18 04:55 zero_pp_rank_10_mp_rank_00_optim_states.pt
-rwxrwxrwx 1 hailey eleuther 287605953 Nov 18 04:55 zero_pp_rank_11_mp_rank_00_optim_states.pt
-rwxrwxrwx 1 hailey eleuther 287605953 Nov 18 04:55 zero_pp_rank_12_mp_rank_00_optim_states.pt
-rwxrwxrwx 1 hailey eleuther 287605953 Nov 18 04:55 zero_pp_rank_13_mp_rank_00_optim_states.pt
-rwxrwxrwx 1 hailey eleuther 287605953 Nov 18 04:55 zero_pp_rank_14_mp_rank_00_optim_states.pt
-rwxrwxrwx 1 hailey eleuther 287605953 Nov 18 04:55 zero_pp_rank_15_mp_rank_00_optim_states.pt
...

Updating the script shouldn't be too hard at all though.

@haileyschoelkopf haileyschoelkopf added the bug Something isn't working label Dec 20, 2022
@StellaAthena StellaAthena modified the milestones: Library Release V2, Release V2 Dec 20, 2022
@VHellendoorn
Copy link
Contributor

I'm facing a somewhat similar problem, so just want to check if it has the same root-cause before I open a new Issue for it.

When I train with NeoX (cloned a few months back) on two nodes with 8 GPUs each, the main node saves checkpoints with files for all layer states, as well as the mp_rank_00_model_states and the rank 0-7 optimizer states. The other node only saves zero optimizer states for ranks 8-15.

I am unable to resume training from these checkpoints; in the above state, the second machine skips loading checkpoints as Client provided checkpoint load path: [...] does not exist (as expected). If I copy the layer states and the mp_rank_00_model_states.pt file over to the second machine, it does load all layers from checkpoint, but the loss immediately spikes to very high values and never recovers. The only sign that something might be off comes from the below -- basically each machine complains about not finding ZeRO checkpoints that belong to GPUs on the other machine, which seems like it shouldn't be a problem.

main: [WARNING] [engine.py:1656:_get_all_zero_checkpoints] The following zero checkpoints paths are missing: ['checkpoints/global_step9000/zero_pp_rank_8_mp_rank_00_optim_states.pt', 'checkpoints/global_step9000/zero_pp_rank_9_mp_rank_00_optim_states.pt', ...]

Do you think #752 will fix this problem? Please let me know if you think this isn't related to the above and I'll make a separate issue!

-Vincent

@StellaAthena
Copy link
Member

@VHellendoorn I think you may have intended to leave this comment on a different issue (#732), as this one is about the HF conversion script rather than the checkpoints themselves.

It sounds like the issue is that you are writing to a non-shared storage space, and I'm not sure if we've tested in that context. Because we use SLURM and Kubernetes to manage our runs on datacenter compute, we need to save our checkpoints to a storage device that's independent of the compute nodes if we want to actually be able to access them later. It seems plausible that a bug crept in unnoticed that only effects when you are writing to local storage though.

If I'm correct about what's going on, can you open an independent issue to discuss? It should be easy to replicate on our end, and if you can confirm that Megatron-DS doesn't have this problem that would be appreciated.

@haileyschoelkopf
Copy link
Contributor Author

Yep, happy to discuss in a separate issue!

Seconding Stella's point, it sounds like non-shared storage between nodes may be the issue.

@VHellendoorn
Copy link
Contributor

Thanks both for chiming in! @StellaAthena you're right, my bad; I'd clicked through to here from that issue. I'll make a new issue soon-ish then; might try to debug it a bit more now that I have a lead :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants