-
Notifications
You must be signed in to change notification settings - Fork 139
Insights: intelligent-machine-learning/dlrover
Overview
-
- 8 Merged pull requests
- 3 Open pull requests
- 0 Closed issues
- 2 New issues
Loading
Could not load contribution data
Please try again later
Loading
8 Pull requests merged by 3 people
-
Optimize hccl port detection
#1191 merged
Jul 13, 2024 -
fix exception when plan is none
#1186 merged
Jul 10, 2024 -
Skip restart training process on failure nodes
#1185 merged
Jul 10, 2024 -
Unify job manager's stop status field
#1184 merged
Jul 10, 2024 -
Sync internal modification.
#1183 merged
Jul 9, 2024 -
Remove the debug code to print variables.
#1178 merged
Jul 9, 2024 -
Improve training port conflict avoid
#1181 merged
Jul 9, 2024 -
Multi issue fixed.
#1182 merged
Jul 9, 2024
3 Pull requests opened by 2 people
-
Add std version output for agent.
#1188 opened
Jul 12, 2024 -
Fix heart beat for concurency.
#1189 opened
Jul 12, 2024 -
Optimize failure node detection
#1190 opened
Jul 12, 2024
2 Issues opened by 2 people
-
Error encountered while using flash attention in TensorFlow
#1180 opened
Jul 8, 2024
3 Unresolved conversations
Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.
-
How to use the elasticity and fault tolerance in a Volcano job.
#1172 commented on
Jul 9, 2024 • 0 new comments -
add util for loss spike save and decode.
#1044 commented on
Jul 11, 2024 • 0 new comments -
[WIP] Pod scaler enhancement: support concurrent creation
#1173 commented on
Jul 9, 2024 • 0 new comments