Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No longer ignore --iteration when passed to train.py #869

Merged
merged 4 commits into from
Apr 11, 2023

Conversation

haileyschoelkopf
Copy link
Contributor

We currently resume training using the checkpoint tag provided in the "latest" file within the save folder, following Deepspeed's default behavior. However, we want to be able to override this "latest" tag when launching a job when, for example, resuming several checkpoints back from a corrupted run.

An --iteration flag that can be passed to deepy.py already exists, and can be successfully used to evaluate a model from any step with evaluate.py , ignoring the "latest" file. However, though this flag properly sets neox_args.iteration at initialization, when initializing a model we ignore this because it is not passed to setup_model_and_optimizer.

This PR resolves this, allowing users to successfully use --iteration alongside train.py.

@Quentin-Anthony @ShivanshuPurohit

StellaAthena
StellaAthena previously approved these changes Apr 9, 2023
Copy link
Member

@StellaAthena StellaAthena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested and found to work, but didn’t stress test unusual cases or potential human errors. Approving, but I’ll leave it to @Quentin-Anthony to decide if it should be merged or tested further.

Also, I noticed we are passing use_cache=False… do we never want to use cache?

@haileyschoelkopf
Copy link
Contributor Author

re: use_cache=False: yes, in training we don't ever want or need cache. If use_cache is ever true it gets set to false when enabling train mode:

recursive_setattr(self.forward_funcs, "use_cache", False)

@Quentin-Anthony
Copy link
Member

I'm happy with this. Merging.

@Quentin-Anthony Quentin-Anthony merged commit 43cc879 into main Apr 11, 2023
@Quentin-Anthony Quentin-Anthony deleted the iteration-kwarg branch April 11, 2023 22:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants