No longer ignore `--iteration` when passed to train.py #869

haileyschoelkopf · 2023-04-02T23:39:00Z

We currently resume training using the checkpoint tag provided in the "latest" file within the save folder, following Deepspeed's default behavior. However, we want to be able to override this "latest" tag when launching a job when, for example, resuming several checkpoints back from a corrupted run.

An --iteration flag that can be passed to deepy.py already exists, and can be successfully used to evaluate a model from any step with evaluate.py , ignoring the "latest" file. However, though this flag properly sets neox_args.iteration at initialization, when initializing a model we ignore this because it is not passed to setup_model_and_optimizer.

This PR resolves this, allowing users to successfully use --iteration alongside train.py.

@Quentin-Anthony @ShivanshuPurohit

StellaAthena

Tested and found to work, but didn’t stress test unusual cases or potential human errors. Approving, but I’ll leave it to @Quentin-Anthony to decide if it should be merged or tested further.

Also, I noticed we are passing use_cache=False… do we never want to use cache?

haileyschoelkopf · 2023-04-09T22:25:25Z

re: use_cache=False: yes, in training we don't ever want or need cache. If use_cache is ever true it gets set to false when enabling train mode:

gpt-neox/megatron/model/gpt2_model.py

Line 331 in 038b011

recursive_setattr(self.forward_funcs, "use_cache", False)

Quentin-Anthony · 2023-04-11T22:32:22Z

I'm happy with this. Merging.

pass iteration to setup_model_and_optimizer

cdd7bc8

haileyschoelkopf requested a review from a team as a code owner April 2, 2023 23:39

haileyschoelkopf requested review from Quentin-Anthony and StellaAthena April 2, 2023 23:39

Update NeoXArgs docs automatically

0709ab7

StellaAthena previously approved these changes Apr 9, 2023

View reviewed changes

Merge branch 'main' into iteration-kwarg

efc1184

Quentin-Anthony dismissed StellaAthena’s stale review via efc1184 April 11, 2023 22:32

Update NeoXArgs docs automatically

9dc964e

Quentin-Anthony approved these changes Apr 11, 2023

View reviewed changes

Quentin-Anthony merged commit 43cc879 into main Apr 11, 2023

Quentin-Anthony deleted the iteration-kwarg branch April 11, 2023 22:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No longer ignore `--iteration` when passed to train.py #869

No longer ignore `--iteration` when passed to train.py #869

haileyschoelkopf commented Apr 2, 2023

StellaAthena left a comment

haileyschoelkopf commented Apr 9, 2023

Quentin-Anthony commented Apr 11, 2023

No longer ignore --iteration when passed to train.py #869

No longer ignore --iteration when passed to train.py #869

Conversation

haileyschoelkopf commented Apr 2, 2023

StellaAthena left a comment

Choose a reason for hiding this comment

haileyschoelkopf commented Apr 9, 2023

Quentin-Anthony commented Apr 11, 2023

No longer ignore `--iteration` when passed to train.py #869

No longer ignore `--iteration` when passed to train.py #869