Skip to content

Commit

Permalink
Save checkpoint metadata when using SlurmRunner (#733)
Browse files Browse the repository at this point in the history
* Fix checkpointing bug by handling case that configs are kept as dict objects, which happens when using the SlurmRunner

Signed-off-by: Dashiell Stander <[email protected]>

* Pre-commit

* Update NeoXArgs docs automatically

* Revert "Update NeoXArgs docs automatically"

This reverts commit 74fc1ef.

* Revert "Revert "Update NeoXArgs docs automatically""

This reverts commit 5c8f66c.

* Update NeoXArgs docs automatically

* Reverse pre-commt

* Update NeoXArgs docs automatically

* Revert "Update NeoXArgs docs automatically"

This reverts commit 1652b66.

* Update NeoXArgs docs automatically

Signed-off-by: Dashiell Stander <[email protected]>
Co-authored-by: github-actions <[email protected]>
Co-authored-by: Quentin-Anthony <[email protected]>
  • Loading branch information
3 people committed Dec 8, 2022
1 parent 4fde59e commit 3a1bd8a
Show file tree
Hide file tree
Showing 2 changed files with 6 additions and 2 deletions.
2 changes: 1 addition & 1 deletion configs/neox_arguments.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,7 @@ Logging Arguments

- **git_hash**: str

Default = 70b6bf8
Default = 5bba068

current git hash of repository

Expand Down
6 changes: 5 additions & 1 deletion megatron/checkpointing.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@

"""Input/output checkpointing."""

import json
import os
import re
import shutil
Expand Down Expand Up @@ -198,7 +199,10 @@ def save_ds_checkpoint(iteration, model, neox_args):
os.makedirs(configs_directory, exist_ok=True)
for config_filename, config_data in neox_args.config_files.items():
with open(os.path.join(configs_directory, config_filename), "w") as f:
f.write(config_data)
if isinstance(config_data, str):
f.write(config_data)
else:
json.dump(config_data, f)


def save_checkpoint(neox_args, iteration, model, optimizer, lr_scheduler):
Expand Down

0 comments on commit 3a1bd8a

Please sign in to comment.