Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Multi-node checkpoint recovery fix #478

Merged
merged 23 commits into from
Jun 10, 2021
Merged

Conversation

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
no changelog needed CHANGELOG.md does not need to be updated in this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Download of recovery checkpoints should only happen on rank 0 in distributed training
3 participants