-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(controller): Retry pod creation on API timeout #4820
Conversation
@someshkoli Would you check the PR? |
Signed-off-by: Daisuke Taniwaki <[email protected]>
Signed-off-by: Daisuke Taniwaki <[email protected]>
workflow/controller/workflowpod.go
Outdated
@@ -394,6 +407,15 @@ func (woc *wfOperationCtx) createWorkflowPod(nodeName string, mainCtr apiv1.Cont | |||
return created, nil | |||
} | |||
|
|||
// shouldRetryCreate returns whether a create request can be retried on the given error. | |||
func (woc *wfOperationCtx) shouldRetryCreate(err error) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you use IsTransientErr
for simplicity?
workflow/controller/workflowpod.go
Outdated
@@ -375,7 +383,12 @@ func (woc *wfOperationCtx) createWorkflowPod(nodeName string, mainCtr apiv1.Cont | |||
pod.Spec.ActiveDeadlineSeconds = &newActiveDeadlineSeconds | |||
} | |||
|
|||
created, err := woc.controller.kubeclientset.CoreV1().Pods(woc.wf.ObjectMeta.Namespace).Create(pod) | |||
var created *apiv1.Pod | |||
err = retryutil.OnError(createPodRetryBackoff, woc.shouldRetryCreate, func() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure the best solution here. We have a limited time (30s) seconds to reconcile a workflow. I wonder if it would be better rather that fail the workflow on transient error - to instead exit early and wait for the next reconciliation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. I update the PR and included server timeout and also service unavailable in IsTransientErr
.
Signed-off-by: Daisuke Taniwaki <[email protected]>
Signed-off-by: Daisuke Taniwaki <[email protected]>
Signed-off-by: Daisuke Taniwaki <[email protected]>
@@ -15,7 +15,7 @@ func IsTransientErr(err error) bool { | |||
return false | |||
} | |||
err = argoerrs.Cause(err) | |||
return isExceededQuotaErr(err) || apierr.IsTooManyRequests(err) || isResourceQuotaConflictErr(err) || isTransientNetworkErr(err) | |||
return isExceededQuotaErr(err) || apierr.IsTooManyRequests(err) || isResourceQuotaConflictErr(err) || isTransientNetworkErr(err) || apierr.IsServerTimeout(err) || apierr.IsServiceUnavailable(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs test obvs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jessesuen I think I recall we excluded timeout as transient error - but I feel it does fit the definition - can you remember why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dtaniwaki I spoke to Jesse. We need to be careful here. We call this func in a poll or back-off loop. A timeout retried multiple times would end up being VERY long in those cases. However, whenever we do that, the error would result in a failed workflow - which should have been retried. Therefore I think this change is correct.
When the control plane gets thousands of pods and itself or subsequent admission webhooks timeout (default 30s), the pod creation request fails, and the workflow controller fails to proceed to evaluate the following steps. This issue becomes problematic when steps have retry node because they remain at
Running
phase even after the control plane returns the timeout error because the workflow controller marks the workflowError
and never check the phases of the templates.Fixes #4583
Checklist: