feat(controller): Retry pod creation on API timeout #4820

dtaniwaki · 2021-01-03T15:27:28Z

When the control plane gets thousands of pods and itself or subsequent admission webhooks timeout (default 30s), the pod creation request fails, and the workflow controller fails to proceed to evaluate the following steps. This issue becomes problematic when steps have retry node because they remain at Running phase even after the control plane returns the timeout error because the workflow controller marks the workflow Error and never check the phases of the templates.

Fixes #4583

Checklist:

My organization is added to USERS.md.

dtaniwaki · 2021-01-03T15:28:41Z

@someshkoli Would you check the PR?

dtaniwaki · 2021-01-03T15:32:31Z

@alexec I made a fix for #4583, but It's hard to create a good reproducible environment, so I added a unit test with a timeout-simulated client.

Signed-off-by: Daisuke Taniwaki <[email protected]>

alexec · 2021-01-03T17:50:31Z

workflow/controller/workflowpod.go

@@ -394,6 +407,15 @@ func (woc *wfOperationCtx) createWorkflowPod(nodeName string, mainCtr apiv1.Cont
 return created, nil
 }

+// shouldRetryCreate returns whether a create request can be retried on the given error.
+func (woc *wfOperationCtx) shouldRetryCreate(err error) bool {


Could you use IsTransientErr for simplicity?

alexec · 2021-01-03T17:52:28Z

workflow/controller/workflowpod.go

@@ -375,7 +383,12 @@ func (woc *wfOperationCtx) createWorkflowPod(nodeName string, mainCtr apiv1.Cont
 pod.Spec.ActiveDeadlineSeconds = &newActiveDeadlineSeconds
 }

- created, err := woc.controller.kubeclientset.CoreV1().Pods(woc.wf.ObjectMeta.Namespace).Create(pod)
+ var created *apiv1.Pod
+ err = retryutil.OnError(createPodRetryBackoff, woc.shouldRetryCreate, func() error {


I'm not sure the best solution here. We have a limited time (30s) seconds to reconcile a workflow. I wonder if it would be better rather that fail the workflow on transient error - to instead exit early and wait for the next reconciliation?

That makes sense. I update the PR and included server timeout and also service unavailable in IsTransientErr.

Signed-off-by: Daisuke Taniwaki <[email protected]>

alexec · 2021-01-04T17:06:20Z

util/errors/errors.go

@@ -15,7 +15,7 @@ func IsTransientErr(err error) bool {
 return false
 }
 err = argoerrs.Cause(err)
- return isExceededQuotaErr(err) || apierr.IsTooManyRequests(err) || isResourceQuotaConflictErr(err) || isTransientNetworkErr(err)
+ return isExceededQuotaErr(err) || apierr.IsTooManyRequests(err) || isResourceQuotaConflictErr(err) || isTransientNetworkErr(err) || apierr.IsServerTimeout(err) || apierr.IsServiceUnavailable(err)


needs test obvs

@jessesuen I think I recall we excluded timeout as transient error - but I feel it does fit the definition - can you remember why?

@dtaniwaki I spoke to Jesse. We need to be careful here. We call this func in a poll or back-off loop. A timeout retried multiple times would end up being VERY long in those cases. However, whenever we do that, the error would result in a failed workflow - which should have been retried. Therefore I think this change is correct.

dtaniwaki requested a review from alexec January 3, 2021 15:31

dtaniwaki changed the title ~~feat(controller): Retry pod creation~~ feat(controller): Retry pod creation on API timeout Jan 3, 2021

dtaniwaki added 2 commits January 4, 2021 01:40

feat(controller): Retry pod creation

27bcd2c

Signed-off-by: Daisuke Taniwaki <[email protected]>

fix: Fix go.mod

991b408

Signed-off-by: Daisuke Taniwaki <[email protected]>

alexec reviewed Jan 3, 2021

View reviewed changes

dtaniwaki added 3 commits January 4, 2021 10:32

refactor(controller): Refactor the error condition func

ad5289b

Signed-off-by: Daisuke Taniwaki <[email protected]>

fix(controller): Add server timeout to IsTransientErr

0631331

Signed-off-by: Daisuke Taniwaki <[email protected]>

Do not retry individual pod creation request

9706473

Signed-off-by: Daisuke Taniwaki <[email protected]>

alexec reviewed Jan 4, 2021

View reviewed changes

alexec self-assigned this Jan 4, 2021

alexec approved these changes Jan 6, 2021

View reviewed changes

alexec added 3 commits January 5, 2021 16:20

Merge branch 'master' into retry-api-call

0bc7c33

Merge branch 'master' into retry-api-call

9acc3f9

Merge branch 'master' into retry-api-call

f50305a

alexec merged commit 6e15878 into argoproj:master Jan 6, 2021

alexec added this to the v3.0 milestone Jan 6, 2021

simster7 mentioned this pull request Jan 12, 2021

Cherry-pick for v2.12.4 #4863

Closed

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(controller): Retry pod creation on API timeout #4820

feat(controller): Retry pod creation on API timeout #4820

dtaniwaki commented Jan 3, 2021 •

edited

Loading

dtaniwaki commented Jan 3, 2021

dtaniwaki commented Jan 3, 2021

alexec Jan 3, 2021

alexec Jan 3, 2021

dtaniwaki Jan 4, 2021

alexec Jan 4, 2021

alexec Jan 4, 2021

alexec Jan 6, 2021

feat(controller): Retry pod creation on API timeout #4820

feat(controller): Retry pod creation on API timeout #4820

Conversation

dtaniwaki commented Jan 3, 2021 • edited Loading

dtaniwaki commented Jan 3, 2021

dtaniwaki commented Jan 3, 2021

alexec Jan 3, 2021

Choose a reason for hiding this comment

alexec Jan 3, 2021

Choose a reason for hiding this comment

dtaniwaki Jan 4, 2021

Choose a reason for hiding this comment

alexec Jan 4, 2021

Choose a reason for hiding this comment

alexec Jan 4, 2021

Choose a reason for hiding this comment

alexec Jan 6, 2021

Choose a reason for hiding this comment

dtaniwaki commented Jan 3, 2021 •

edited

Loading