Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Fix timeouts when downloading multiple checkpoint files #498

Merged
merged 9 commits into from
Jun 22, 2021

Conversation

ant0nsc
Copy link
Contributor

@ant0nsc ant0nsc commented Jun 21, 2021

Downloading multiple checkpoints uses a codepath that has a fixed 120sec timeout. Instead, use multiple individual download operations.

Please follow the guidelines for PRs contained here. Checklist:

  • Ensure that your PR is small, and implements one change.
  • Add unit tests for all functions that you introduced or modified.
  • Run PyCharm's code cleanup tools on your Python files.
  • Link the correct GitHub issue for tracking.
  • Update the Changelog file: Describe your change in terms of
    Added/Changed/Removed/... in the "Upcoming" section.
  • When merging your PR, replace the default merge message with a description of your PR,
    and if needed a motivation why that change was required.

@ant0nsc ant0nsc enabled auto-merge (squash) June 21, 2021 19:12
Copy link
Contributor

@dumbledad dumbledad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks good.

@ant0nsc ant0nsc merged commit 7cd7e58 into main Jun 22, 2021
@ant0nsc ant0nsc deleted the antonsc/downloadall branch June 22, 2021 13:33
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Download of recovery checkpoints should only happen on rank 0 in distributed training
3 participants