You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As the official portal introduced, torch-elastic has been upstreamed to pytorch >=1.9. KubeDL manages the lifecycle of jobs and orchestrate their resources, it is critical to implement torch-elastic distributed training protocol and brings a fault-tolerance & elastic experience, therefore, job completion time(JCT) can be significantly shortened while resources(both cpu/memory and gpus) be better utilized.
Goals to be achieved:
Design clean & user-friendly elastic training APIs and .
Implement elastic training control flow on pytorch-controller.
[Advanced] design a scaling out/in algorithm for user customized metrics.
SimonCqk
changed the title
[ASoC 2022] Implement native pytorch elastic training style based on torch-elastic protocol.
[ASoC 2022] Implement native pytorch elastic training fashion based on torch-elastic protocol.
May 30, 2022
Background:
As the official portal introduced, torch-elastic has been upstreamed to pytorch >=1.9. KubeDL manages the lifecycle of jobs and orchestrate their resources, it is critical to implement torch-elastic distributed training protocol and brings a fault-tolerance & elastic experience, therefore, job completion time(JCT) can be significantly shortened while resources(both cpu/memory and gpus) be better utilized.
Goals to be achieved:
Additional context:
This issue is part of our ASoC 2022 Program.
Difficulty: Normal
Mentor: Qiukai Chen (@SimonCqk )
The text was updated successfully, but these errors were encountered: