Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ASoC 2022] Implement native pytorch elastic training fashion based on torch-elastic protocol. #251

Open
SimonCqk opened this issue May 30, 2022 · 0 comments
Labels
asoc2022 Alibaba Summer of Code, 2022 community Community discussions enhancement New feature or request

Comments

@SimonCqk
Copy link
Collaborator

SimonCqk commented May 30, 2022

Background:

As the official portal introduced, torch-elastic has been upstreamed to pytorch >=1.9. KubeDL manages the lifecycle of jobs and orchestrate their resources, it is critical to implement torch-elastic distributed training protocol and brings a fault-tolerance & elastic experience, therefore, job completion time(JCT) can be significantly shortened while resources(both cpu/memory and gpus) be better utilized.

Goals to be achieved:

  • Design clean & user-friendly elastic training APIs and .
  • Implement elastic training control flow on pytorch-controller.
  • [Advanced] design a scaling out/in algorithm for user customized metrics.

Additional context:

This issue is part of our ASoC 2022 Program.

Difficulty: Normal
Mentor: Qiukai Chen (@SimonCqk )

@SimonCqk SimonCqk added enhancement New feature or request asoc2022 Alibaba Summer of Code, 2022 community Community discussions labels May 30, 2022
@SimonCqk SimonCqk changed the title [ASoC 2022] Implement native pytorch elastic training style based on torch-elastic protocol. [ASoC 2022] Implement native pytorch elastic training fashion based on torch-elastic protocol. May 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
asoc2022 Alibaba Summer of Code, 2022 community Community discussions enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant