Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(controller): Retry pod creation on API timeout #4820

Merged
merged 8 commits into from
Jan 6, 2021
Merged

feat(controller): Retry pod creation on API timeout #4820

merged 8 commits into from
Jan 6, 2021

Conversation

dtaniwaki
Copy link
Member

@dtaniwaki dtaniwaki commented Jan 3, 2021

When the control plane gets thousands of pods and itself or subsequent admission webhooks timeout (default 30s), the pod creation request fails, and the workflow controller fails to proceed to evaluate the following steps. This issue becomes problematic when steps have retry node because they remain at Running phase even after the control plane returns the timeout error because the workflow controller marks the workflow Error and never check the phases of the templates.

Fixes #4583

Checklist:

@dtaniwaki
Copy link
Member Author

@someshkoli Would you check the PR?

@dtaniwaki dtaniwaki requested a review from alexec January 3, 2021 15:31
@dtaniwaki
Copy link
Member Author

@alexec I made a fix for #4583, but It's hard to create a good reproducible environment, so I added a unit test with a timeout-simulated client.

@dtaniwaki dtaniwaki changed the title feat(controller): Retry pod creation feat(controller): Retry pod creation on API timeout Jan 3, 2021
Signed-off-by: Daisuke Taniwaki <[email protected]>
@@ -394,6 +407,15 @@ func (woc *wfOperationCtx) createWorkflowPod(nodeName string, mainCtr apiv1.Cont
return created, nil
}

// shouldRetryCreate returns whether a create request can be retried on the given error.
func (woc *wfOperationCtx) shouldRetryCreate(err error) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use IsTransientErr for simplicity?

@@ -375,7 +383,12 @@ func (woc *wfOperationCtx) createWorkflowPod(nodeName string, mainCtr apiv1.Cont
pod.Spec.ActiveDeadlineSeconds = &newActiveDeadlineSeconds
}

created, err := woc.controller.kubeclientset.CoreV1().Pods(woc.wf.ObjectMeta.Namespace).Create(pod)
var created *apiv1.Pod
err = retryutil.OnError(createPodRetryBackoff, woc.shouldRetryCreate, func() error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the best solution here. We have a limited time (30s) seconds to reconcile a workflow. I wonder if it would be better rather that fail the workflow on transient error - to instead exit early and wait for the next reconciliation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I update the PR and included server timeout and also service unavailable in IsTransientErr.

@@ -15,7 +15,7 @@ func IsTransientErr(err error) bool {
return false
}
err = argoerrs.Cause(err)
return isExceededQuotaErr(err) || apierr.IsTooManyRequests(err) || isResourceQuotaConflictErr(err) || isTransientNetworkErr(err)
return isExceededQuotaErr(err) || apierr.IsTooManyRequests(err) || isResourceQuotaConflictErr(err) || isTransientNetworkErr(err) || apierr.IsServerTimeout(err) || apierr.IsServiceUnavailable(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs test obvs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jessesuen I think I recall we excluded timeout as transient error - but I feel it does fit the definition - can you remember why?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dtaniwaki I spoke to Jesse. We need to be careful here. We call this func in a poll or back-off loop. A timeout retried multiple times would end up being VERY long in those cases. However, whenever we do that, the error would result in a failed workflow - which should have been retried. Therefore I think this change is correct.

@alexec alexec self-assigned this Jan 4, 2021
@alexec alexec merged commit 6e15878 into argoproj:master Jan 6, 2021
@alexec alexec added this to the v3.0 milestone Jan 6, 2021
@simster7 simster7 mentioned this pull request Jan 12, 2021
15 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cannot retry a steps workflow w/ pod creation failure
2 participants